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Chapter  1 
Introduction 


Suppose  one  wanted  to  construct  a  building-sized  parallel  computer.  What 
are  the  technical  obstacles  preventing  one  from  doing  so?  The  list  is  made 
up  of  scalability  concerns,  issues  which  become  increasingly  important  as  the 
machine  gets  larger.  One  key  concern  is  the  reliability,  which  decreases  with 
the  size  of  the  computer  [10]. 

The  reliability  problem  can  be  combatted  by  introducing  a  level  of  fault 
tolerance  into  the  computer.  There  are  two  main  methods  of  providing  fault 
tolerance.  The  first  is  static  reconfiguration,  where  the  state  of  the  machine 
is  checkpointed  regularly  to  stable  storage.  Faults  cause  the  machine  to  halt, 
a  reconfiguration  is  performed  where  spares  are  brought  in  or  portions  of  the 
machine  are  disabled,  and  the  state  of  the  computation  is  restored  from  the 
last  checkpoint.  Unfortunately,  as  the  machine  gets  very  large  and  failures 
quite  frequent,  the  machine  spends  most  of  its  time  checkpointing.  It  may 
even  reach  the  point  where  the  machine  is  unable  to  reliably  checkpoint  itself. 

The  second  method  of  providing  fault  tolerance  is  dynamic  reconfigura¬ 
tion,  where  state  is  saved  and  restored  in  local  areas  of  the  machine  through 
replication.  This  allows  the  machine  to  continue  operation  while  a  fault  is  oc¬ 
curring.  This  replication  is  not  without  costs.  For  example,  one  may  replicate 
the  entire  communication  fabric  of  the  computer  to  ensure  reliable  commu¬ 
nication  between  processors,  which  loses  by  factor  of  two  in  performance  to 
non-reliable  machines  with  equivalent  wiring  complexity.  Alternatively,  one 
may  use  end-to-end  protocols  for  communications,  which  require  storage  in 
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the  processors  to  possibly  repeat  lost  messages  and  a  method  of  eliminating 
duplicate  messages  upon  receipt. 

We  propose  in  this  thesis  a  third  method  of  achieving  reliable  communica¬ 
tion  between  processors.  This  method  relies  on  the  use  of  replicated  storage 
in  the  communication  network  and  a  new  protocol,  the  Unique  Token  Proto¬ 
col.  The  protocol  requires  a  constant  amount  of  storage  per  communication 
element  and  hence  is  no  barrier  to  scaling.  It  consumes  no  additional  wire 
bandwidth.  Lastly,  the  overhead  for  elimination  of  duplicate  messages  is 
only  incurred  when  failures  actually  happen  and  thus  the  overhead  becomes 
essentially  negligible. 

We  will  also  show  the  practicality  of  the  unique  token  protocol  by  con¬ 
structing  a  register  transfer  level  description  of  a  communication  element. 
This  model  is  fairly  complex,  as  one  needs  an  adaptive  router  to  go  around 
possible  network  faults,  and  the  router  must  be  deadlock  free.  As  a  result  of 
the  base  level  of  complexity,  the  model  is  kept  as  simple  as  possible  to  avoid 
obscuring  the  implementation  of  the  reliability  protocol. 

This  thesis  is  organized  into  seven  main  sections.  Chapter  2  details  the 
subset  of  parallel  computers  for  which  the  reliable  protocol  was  first  designed. 
Properties  of  the  communication  networks  which  impact  protocols  are  given. 
A  simple  fault  model  for  those  networks  is  presented.  Chapter  3  explores 
the  scalability  of  end-to-end  protocols  to  very  large  parallel  computers.  It 
demonstrates  that  the  storage  requirements  for  such  protocols  do  not  scale 
linearly  with  the  number  of  processors.  Chapter  4  presents  the  unique  token 
protocol  which  uses  replication  in  the  network.  This  protocol  is  shown  to  use 
storage  which  scales  linearly  with  the  size  of  the  machine.  Chapter  5  presents 
an  overview  of  a  router  designed  to  support  the  new  protocol.  Chapter  6  gives 
a  detailed  presentation  of  the  microarchitecture.  Chapter  7  describes  the 
design  validation  which  was  performed  using  simulations.  Finally,  chapter  8 
presents  conclusions. 
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Chapter  2 

Communication  Networks 


There  exists  a  large  variety  of  communication  networks  for  parallel  comput¬ 
ers.  The  major  categories  include  butterflies  [3],  shuffle  exchanges,  hyper¬ 
cubes  [1],  and  meshes  [6].  Further  permutations  exist  when  one  accounts  for 
storage  in  the  network,  store- and-forward  versus  cut-through  [12]  and  worm- 
hole,  congestion  management  policies,  etc.  All  of  these  have  an  impact  on 
the  protocols  used  to  reliably  transport  data.  In  order  to  focus  on  the  proto¬ 
col  and  not  the  foibles  of  the  network,  we  restrict  ourselves  to  certain  types 
of  networks  which  have  uniform  properties.  These  types  of  networks  are  not 
readily  described  by  the  terms  “3D  mesh”  or  “hypercube”.  Instead,  we  will 
define  what  they  are  in  terms  of  their  topologies  and  congestion  management 
policies. 

We  are  also  contrasting  reliable  communication  protocols,  which  implies 
that  portions  of  the  network  will  sometimes  fail.  At  the  end  of  this  chapter, 
we  will  describe  the  network  fault  model  used. 


2.1  Topology 

An  interconnection  network  is  a  strongly-connected  directed  graph,  I  =  G{N,  C). 
The  vertices  of  /  r.re  a  set  of  nodes,  N.  N  itself  is  the  union  of  the  set  of 
processing  nodes  P  and  the  set  of  switching  nodes  5.  The  edges  are  a  set  of 
channels,  C  =  Cinjtct  U  Cuantport  u  Ctxtract,  where: 
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Cinjeet  ^  P  X  S 
Ctrantport  ^  S  X  S 

(^extract  ^  S  X  P 

Each  channel  is  unidirectional  and  carries  data  from  a  source  node  to 
a  destination  node.  A  bidirectional  network  is  one  where  (nl,n2)  €  » 

(n2,  nl)  6  C.  Since  symmetric  communication  paths  between  processors  sim¬ 
plify  the  analysis  of  things  such  as  round  trip  times,  the  networks  considered 
here  are  constrained  to  be  bidirectional. 

The  network  must  be  able  to  maintain  connectivity  in  the  event  of  a  fault. 
The  network ’s  thus  required  to  have  at  least  two  disjoint  paths  between  smy 
two  switching  nodes.  The  paths  must  be  disjoint  in  terms  of  the  switching 
elements  appearing  on  each  path. 

Many  topologys  satisfy  these  requirements.  They  include  toruses,  hy¬ 
percubes,  meshes,  and  express  cubes  [8].  Later,  examples  of  the  protocols 
will  be  given  using  two-dimensional  mesh  networks.  This  is  to  facilitate  the 
presentation  and  should  not  be  construed  as  a  binding  requirement. 


2.2  Congestion  Management 

Using  the  unique  token  protocol,  the  network  promises  to  always  deliver  the 
packet.  The  network  is  not  allowed  to  drop  packets,  which  implies  that  the 
network  must  have  storage  for  packets  in  the  event  of  congestion.  A  switching 
node  may  provide  storage  for  an  entire  packet,  as  in  the  case  of  store-and- 
forward  networks.  Alternatively,  only  portions  of  a  packet  may  be  stored  at 
each  switching  node,  as  in  virtual  cut-through  and  worm-hole  routing.  The 
choice  of  buffering  strategy  does  not  affect  the  unique  token  protocol. 

The  network  must  have  some  facility  for  routing  around  failed  channels 
and/or  switching  nodes.  For  example,  dimension-ordered  routing  in  a  3D- 
mesh  network  is  not  tolerant  of  any  minor  irregularities  in  the  topology. 
Adaptive  routing  algorithms  such  as  those  using  virtual  channels  or  chaotic 
methods  are  required  [4,  5,  14,  11].  Further,  such  algorithms  must  either  be 
deadlock  free  or  provide  a  method  other  than  packet  dropping  to  break  the 
deadlock  [9]. 
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2.3 


Fault  Model 


The  fault  model  used  throughout  this  thesis  is  a  simple  one.  We  assume  that 
failures  occur  at  single  points.  We  further  require  that  the  time  between 
failures  is  sufficiently  large  to  allow  the  network  to  heal. 

For  wires,  the  fault  model  is  a  stuck-at  fault.  When  a  wire  fault  is  de¬ 
tected,  the  fault  is  assumed  to  be  permanent,  not  transient.  This  is  done 
because  the  process  of  deciding  whether  or  not  a  wire  failure  is  transient 
is  technology  dependent.  For  example,  a  wire  failure  between  two  adjacent 
chips  on  a  PC  board  is  usually  taken  as  indication  that  something  is  physi¬ 
cally  wrong  with  the  board.  If  the  wire  is  actually  a  cross  country  link  using 
commercial  phone  lines,  one  is  more  apt  to  blame  noise  for  the  failure. 

Nodes  are  required  to  operate  in  a  fail-stop  faishion  for  much  the  same 
reason,  in  that  there  does  not  exist  a  good  technology  or  implementation  in¬ 
dependent  model  of  node  failures.  The  unique  token  protocol  is  not  designed 
to  be  robust  in  the  presence  of  Byzantine  failures.  If  one  did  not  impose 
the  fail-stop  requirement,  one  would  always  able  to  contruct  some  failure 
mechanism  which  causes  a  particular  implementation  of  the  protocol  to  also 
fail. 

Lastly,  we  require  that  failures  not  occur  too  closely  to  one  amother.  This 
is  to  avoid  multiple  single  point  failures  from  appearing  to  be  multiple  point 
failures. 
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Chapter  3 

Reliable  End-to-End  Protocols 


We  have  already  seen  that  large  scale  parallel  computers  require  reliable 
interprocessor  communication.  In  this  chapter,  we  will  look  at  the  problem  a 
reliable  protocol  tries  to  solve  and  then  explore  several  end-to-end  protocols. 
We  will  see  that  while  these  protocols  do  solve  the  reliability  problem,  they 
do  not  scale  linearly  with  the  size  of  the  machine.  We  measure  machine  size 
by  the  number  of  processors  and  represent  the  size  of  the  machine  by  the 
letter  n. 


3.1  Characteristics  of  the  Protocols 

An  entity  on  node  A  wishes  to  send  a  message  to  an  entity  on  node  B.  We 
wish  that  the  entity  on  node  B  processes  the  message  exactly  once.  A  subtle 
point  is  that  is  perfectly  acceptable  for  node  B  to  receive  many  copies  of  the 
same  message,  as  long  they  are  proc^sed  exactly  once.  ^ 

If  a  portion  of  the  network  which  holds  the  message  as  it  transits  from 
node  A  to  node  B  could  fail,  then  the  message  needs  to  be  replicated  some¬ 
where.  It  could  be  replicated  in  the  network,  at  node  A,  or  both.  Where  the 

‘This  may  be  loosened  even  further  to  allow  the  multiple  processing  of  the  same  mes¬ 
sage,  as  long  as  the  processing  is  idempotent.  We  believe  that  most  messages  in  parallel 
computers  result  in  actions  which  are  not  idempotent,  so  we  are  unable  to  take  advantage 
of  idempotence. 
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the  replication  occurs  is  a  key  property  of  the  protocols. 

With  multiple  copies  of  the  same  message  and  imperfect  communication 
between  any  two  copies,  it  is  possible  to  deliver  duplicates  of  a  message  to 
node  B.  The  destination  must  now  discard  all  duplicates  and  process  the 
message  exactly  once.  Thus,  the  other  key  property  of  a  protocol  is  how  the 
destination  detects  and  discards  redundant  messages. 


3.2  Alternating  Bit  Protocol 

The  alternating  bit  protocol  [2,  16,  19]  relies  on  a  FIFO  communication 
channel  between  two  nodes,  A  and  B.  The  channel  is  bidirectional.  The 
alternating  bit  is  a  reusable  sequence  number  ranging  over  0  and  1. 

The  protocol  begins  with  node  A  sending  a  message  to  node  B.  The  mes¬ 
sage  contains  the  sequence  number  0.  Node  A  repeatedly  sends  the  message 
until  it  receives  an  acknowledgment  having  a  sequence  number  of  0.  Node  A 
then  shifts  to  sequence  number  1  and  starts  its  next  message.  Node  B  sends 
acknowledgments  each  time  it  successfully  receives  a  message.  Node  B  only 
processes  messages  whose  sequence  number  differed  from  the  last  sequence 
number  successfully  received. 

This  protocol  cannot  readily  be  used  in  a  parallel  processor  for  the  fol¬ 
lowing  reasons; 

1.  It  does  not  allow  multiple,  unacknowledged  messages  to  be  in  flight 
between  the  two  nodes.  This  is  a  severe  performance  bottleneck. 

2.  It  relies  on  FIFO  channels.  This  is  certainly  not  the  case  in  a  network 
with  adaptive  routing. 

3.  The  sequence  number  must  be  maintained  between  each  pair  of  nodes. 
This  scales  as  O(n^),  which  is  not  acceptable  for  large  machines. 
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3.3  Time  Stamps 


Time  stamps  are  a  way  of  augmenting  the  alternating  bit  protocol  to  allow 
multiple  messages  in  flight  and  account  for  out-of-order  message  delivery. 
When  node  A  first  decides  to  send  a  message  to  node  B,  it  stamps  it  with 
the  current  time.  The  message  is  then  sent  to  node  B. 

When  node  B  receives  the  message,  it  checks  to  see  if  this  message  is 
newer  than  all  previous  messages  from  node  A.  If  it  is,  node  B  processes  the 
message  and  returns  a  positive  acknowledgment  to  node  A.  If  the  message 
was  not  newer,  a  negative  acknowledgment  is  returned. 

If  node  A  receives  a  positive  acknowledgment,  it  is  done  with  that  mes¬ 
sage.  If  the  node  A  receives  a  negative  acknowledgment  for  a  message  it 
believes  to  have  been  sent  successfully,  the  negative  acknowledgment  is  ig¬ 
nored.  If  it  receives  a  negative  acknowledgment  for  a  message  which  it  be¬ 
lieves  hasn’t  been  successful,  it  updates  the  time  stamp  on  the  message  and 
resends  the  message.  Lastly,  node  A  may  time  out  and  simply  resend  the 
message. 

This  protocol  can  be  used  in  a  parallel  processor,  but  it  still  suffers  from 
the  following  problems: 

1.  When  messages  arrive  at  the  destination  out  of  order,  they  will  need 
to  be  retransmitted. 

2.  The  sequence  number  must  be  maintsdned  between  each  pair  of  nodes. 
This  scales  as  O(n^),  which  is  not  acceptable  for  large  machines. 

3.  Choosing  the  size  of  the  timeout  window  is  difficult  when  the  network 
has  large  variations  in  latency. 

3.4  Windowed  Time  Stamps 

This  protocol  is  basically  the  same  as  the  time  stamp  protocol,  except  that 
the  destination  remembers  the  n  newest  time  stamps  it  has  seen  from  a 
source.  This  allows  the  destination  to  accept  messages  which  arrive  slightly 
out  of  order.  The  protocol  only  suffers  from: 
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1.  The  size  of  time  stamp  tables  in  the  destination  do  not  scale. 

2.  Choosing  the  size  of  the  timeout  window  is  difficult  when  the  network 
has  large  variations  in  latency. 


3.5  Summary 

We  have  explored  three  different  end-to-end  protocols.  We  have  seen  that  the 
expected  space  requirements  of  the  protocols  are  not  constant  per  node,  ow¬ 
ing  to  the  n?  growth  in  the  destination  time  stamp  tables  and  the  y/n  growth 
of  the  source  buffer  pools^.  Further,  determining  “good”  source  buffer-pool 
sizes  and  time-out  intervals  is  largely  empirical.  These  parameters  may  not 
yield  satisfactory  performance  when  new  programs  offer  traffic  not  in  accor¬ 
dance  with  the  simulation  model.  We  are  thus  motivated  to  look  for  new 
protocols,  which  is  the  subject  of  the  next  chapter. 


derivation  is  contained  in  Appendix  A 
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Chapter  4 

The  Unique  Token  Protocol 


We  saw  that  the  end-to-end  protocols  replicate  packets  in  the  source,  requir¬ 
ing  storage  in  amounts  that  do  not  scale  linearly  with  the  size  of  the  machine. 
The  destination  requires  a  large  table  to  eliminate  duplicates  which  also  does 
not  scale.  As  alluded  to  in  the  chapter  on  end-to-end  protocols,  there  aire 
several  places  in  a  system  where  packets  can  be  replicated  and  many  ways 
for  the  destination  to  eliminate  duplicates.  In  this  chapter,  we  propose  an  al¬ 
ternative  reliable  protocol  which  uses  the  network  to  replicate  packets.  This 
protocol,  the  unique  token  protocol,  will  be  shown  to  have  both  satisfactory 
performance  and  implementation  costs. 


4.1  Overview  of  the  Protocol 

We  observe  that  many  networks  already  provide  storage  for  packets  in  flight. 
We  propose  to  use  that  storage  for  maintaining  duplicate  copies  of  a  packet 
in  the  network,  rather  than  keeping  copies  at  the  source.  By  distributing  the 
storage  throughout  the  network,  we  will  be  able  to  bound  the  storage  to  a 
constant  amount  per  node. 

We  also  noticed  that  the  end-to-end  protocols  required  large  tables  to 
handle  possible  packet  duplication,  even  though  most  the  time  duplicates 
were  not  sent.  It  would  be  desirable  if,  when  the  network  delivers  multiple 
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copies^  of  a  packet,  all  copies  of  the  packet  were  marked  as  being  copies. 
Thus,  when  a  packet  arrived  marked  as  unique,  the  destination  node  would 
not  have  to  worry  about  duplicate  elimination.  Only  in  the  rare  case  of  actual 
errors  would  the  duplicate  elimination  machinery  be  used. 

We  begin  the  presentation  of  the  unique  token  protocol  with  a  model  of 
a  source  node  sending  a  packet  to  a  destination  node,  where  the  packet  is 
buffered  and  forwarded  at  several  intervening  switching  nodes  along  the  way. 
The  buffering  and  forwarding  process  needs  to  ensure  that  at  least  two  copies 
of  the  packet  exist  along  the  path  from  source  to  destination  at  all  times. 
This  is  done  by  first  copying  the  packet  forward  one  node,  then  allowing 
the  release  of  the  storage  in  the  rearmost  node.  This  is  shown  in  figure  4.1. 
Note  that  any  node  along  the  path  only  receives  one  copy  of  the  packet,  even 
though  multiple  copies  are  kept. 

When  the  packet  first  enters  the  network,  a  token  is  sent  along  behind 
the  packet.  We  will  use  the  token  to  determine  when  the  packet  has  been 
duplicated  in  the  network  by  preserving  an  invariant,  that  no  copies  of  the 
packet  exist  in  the  path  behind  the  token.  Thus,  if  a  single  path  is  used 
to  connect  source  and  destination,  the  arrival  of  a  token  at  the  destination 
implies  that  the  packet  has  been  delivered  exactly  once. 

When  a  portion  of  the  network  fails,  communication  between  the  advance 
and  rear  copies  of  the  packet  may  be  severed.  Neither  copy  knows  the  fate 
of  the  other,  so  they  both  must  make  their  way  to  the  destination.  When 
they  arrive  at  the  destination,  both  packets  must  be  marked  in  such  a  way 
as  to  cause  the  destination  to  look  for  duplicates.  The  protocol  does  this  by 
establishing  two  types  of  tokens:  unique  or  replica.  When  the  token  is  first 
sent  along  behind  the  packet,  the  type  is  unique.  If  the  network  ever  has  to 
use  multiple  paths  while  forwarding  a  pacl%t,  the  token  is  changed  to  type 
replica  for  all  copies  of  the  packet. 

The  conversion  of  tokens  of  type  unique  to  tokens  of  type  replica  happens 
when  the  network  fails.  The  rear  copy  of  the  packet  is  still  on  the  path 
leading  from  the  source  and  will  have  a  token  coming  up  from  behind.  The 
rear  copy  simply  chooses  a  new  forward  path  to  the  destination  and  copies 
itself  forward.  When  the  token  arrives  at  the  point  of  reroute,  it  is  converted 

^  “copies”  includes  the  original,  plus  all  replicas  made  in  the  event  of  the  loss  of  the 
original. 
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Figure  4.1:  An  illustration  of  packets  and  the  token  moving  through  the  network, 
(a)  Switching  node  A  has  the  token  and  a  copy  of  the  packet,  switching  node  B  has 
a  copy  of  the  packet,  and  switching  node  C  is  empty.  The  token-passing  portion 
of  the  protocol  begins  with  the  nodes  in  this  state,  (b)  Switching  node  B  then 
copies  its  packet  to  node  C.  (c)  When  node  B  has  successfully  copied  its  packet 
to  node  C,  node  B  sends  an  acknowledgment  to  node  A.  Node  A  now  knows  that 
the  packet  is  safely  in  two  other  places,  and  is  able  to  delete  its  copy,  (d)  The  last 
remaining  step  is  for  node  A  to  pass  the  token  to  node  B. 

to  replica.  The  advance  copy  lost  the  path  back  to  the  source  and  will  not 
see  a  token.  The  advance  copy  generates  a  new  token  of  type  replica  and 
continues  on  to  the  destination.  We  rely  on  ptMdcets  containing  some  form  of 
unique  identifier  to  allow  the  destination  to  eliminate  the  duplicates. 
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4.2  Protocol  State  Diagram 


The  full  protocol  used  by  a  single  node  for  moving  packets  and  tokens  is 
shown  in  figure  4.2.  There  are  12  states  in  total.  States  1-6  are  used  under 
normal  operation,  states  7-12  are  used  when  errors  occur.  In  the  detailed 
description  that  follows,  there  are  actually  three  nodes  involved.  There  is 
the  node  whose  state  transitions  are  being  displayed  (simply  the  node),  the 
node  closer  to  the  source  (the  upstream  node),  and  the  node  closer  to  the 
destination  (the  downstream  node). 

The  inputs  to  the  state  machine  are  tokens,  packets,  and  acknowledg¬ 
ments.  Tokens  and  packets  are  placed  in  one  first-in-6rst-out  (FIFO)  queue, 
acknowledgments  are  placed  in  a  second.  This  allows  the  state  machine  to 
disregard  the  order  of  arrival  of  these  events.  In  particular,  this  simplifies  the 
handling  of  the  arrival  of  a  second  packet  while  the  node  is  still  forwarding 
a  token. 

(1)  Idle  The  idle  state  is  left  whenever  a  packet  arrives  from  the  upstream 

node. 

(2)  Forward  Packet  The  node  determines  a  route  for  the  packet  and  for¬ 

wards  the  packet  to  the  downstream  node.  The  node  waits  a  short, 
fixed  period  of  time  to  allow  the  downstream  node  to  signal  the  success 
or  failure  of  the  transfer. 

(3)  Send  Ack  The  node  sends  an  acknowledgment  to  the  upstream  node, 

indicating  that  three  copies  of  the  packet  now  exist  and  that  it  is  safe 
for  the  upstream  node  to  erase  its  copy. 

(4)  Wait  for  Ack  The  node  now  waits  for  the  downstream  node  to  tell  it 

that  its  copy  of  the  packet  may  be  erased. 

(5)  Wait  for  Token  The  node  erases  its  copy  of  the  packets  and  waits  for 

the  token  from  upstream. 

(6)  Forward  Token  The  token  is  sent  downstream  and  the  automaton  re¬ 

turns  to  idle. 


19 


(1) 


Figure  4.2:  State  Diagram  for  the  Reliable  Protocol 


(7)  Reroute  The  packet  is  rerouted  and  forwarded.  This  process  repeats 

until  succeeding. 

(8)  Send  Ack  The  node  sends  an  acknowledgment  to  the  upstream  node, 
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indicating  that  three  copies  of  the  packet  now  exist  and  that  it  is  safe 
for  the  upstream  node  to  erase  its  copy.  Since  this  state  is  entered 
under  error  conditions,  the  success  or  failure  of  the  acknowledgment 
transmission  does  not  matter. 

(9)  Wait  for  Ack  The  node  now  waits  for  the  downstream  node  to  tell  it 

that  its  copy  of  the  packet  may  now  be  erased. 

(10)  Reroute  The  packet  is  rerouted  and  forwarded.  This  process  repeats 
until  succeeding.  Since  an  acknowledgment  has  already  been  sent  up¬ 
stream,  this  state  returns  to  the  w2ut-for-ack  state  (9). 

(11)  Wait  for  Token  The  node  erases  its  copy  of  the  packets  and  waits  for 
the  token  from  upstream. 

(12)  Send  Replica  This  state  is  entered  only  after  an  error  has  occurred, 
so  the  token  sent  downstream  must  be  of  type  replica. 

4.3  An  Example 

We  now  give  examples  of  the  protocol  moving  a  packet  through  a  simple 

network  topology,  as  shown  in  figure  4.3.  Node  A  has  generated  a  tagged 

packet  which  is  destined  for  node  B. 

The  tagged  packet  must  travel  via  either  switching  elements  SE1,SE2,SE4 

or  SE1,SE3,SE4.  We  will  assume  that  the  preferred  path  is  SE1,SE2,SE4. 

Here  is  the  protocol,  given  that  no  errors  occur. 

1.  The  protocol  begins  with  node  A  holding  a  tagged  unique  packet.  Node 
A  first  transmits  the  packet  part  of  the  tagged  packet  to  SEl. 

2.  SEl  sends  the  packet  part  to  SE2  while  keeping  a  copy. 

3.  When  SEl  determines  that  the  packet  part  has  been  successfully  trans¬ 
ferred  to  SE2,  SEl  sends  an  acknowledgment  to  Node  A. 

4.  When  node  A  receives  the  acknowledgment,  it  knows  that  the  packet 
is  now  safely  stored  in  two  places  in  the  network:  SEl  and  SE2.  Node 
A  then  deletes  its  copy  of  the  packet  and  sends  the  token  to  SEl. 
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Figure  4.3:  Simple  Network  Topology 


5.  Meanwhile,  SE2  has  transmitted  the  packet  to  SE4  and  sent  an  ac¬ 
knowledgment  to  SEl. 

6.  When  SEl  has  both  the  acknowledgment  and  the  token,  it  deletes  its 
copy  of  the  packet  and  sends  the  token  to  SE2. 

7.  SE4  sends  the  packet  to  node  B  and  sends  an  acknowledgment  to  SE2. 

8.  SE2  now  has  the  token  and  an  acknowledgment,  so  it  deletes  the  copy 
and  sends  the  token  to  SE4. 

9.  Node  B,  being  the  final  destination  of  the  tagged  packet,  sends  an 
acknowledgment  to  SE4  as  soon  as  it  receives  the  packet  portion. 

10.  SE4  deletes  its  copy  and  sends  the  token  to  node  B. 


Node  B  now  has  a  tagged  unique  packet  and  we  see  that  no  copies  exist  in 
either  node  A  or  any  of  the  switching  elements.  The  tagged  unique  packet 
has  been  delivered  exactly  once. 

We  now  need  to  examine  what  the  protocol  does  when  errors  occur.  In 
this  example,  the  fault  will  occur  in  the  channel  between  SEl  and  SE2. 
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1.  The  protocol  begins  with  node  A  holding  a  tagged  unique  packet.  Node 
A  first  transmits  the  packet  part  of  the  tagged  packet  to  SEl. 

2.  SEl  sends  the  packet  part  to  SE2  while  keeping  a  copy. 

3.  When  SEl  determines  that  the  packet  part  has  been  successfully  trans¬ 
ferred  to  SE2,  SEl  sends  an  acknowledgment  to  Node  A. 

4.  When  node  A  receives  the  acknowledgment,  it  knows  that  the  packet 
is  now  safely  stored  in  two  places  in  the  network:  SEl  and  SE2.  Node 
A  then  deletes  its  copy  of  the  packet  and  sends  the  token  to  SEl. 

5.  SE2  now  transmits  the  packet  to  SE4  and  tries  to  send  an  acknowledg¬ 
ment  to  SEl,  but  the  channel  between  SEl  and  SE2  goes  down.  SE2 
manufactures  a  replica  token  and  proceeds  as  if  the  token  was  received 
from  SEl. 

6.  SE4  sends  the  packet  to  node  B  and  sends  an  acknowledgment  to  SE2. 

7.  SE2  now  has  a  replica  token  and  an  acknowledgment,  so  it  deletes  the 
copy  and  sends  the  replica  token  to  SE4. 

8.  Node  B,  being  the  final  destination  of  the  tagged  packet,  sends  an 
acknowledgment  to  SEM  as  soon  as  it  receives  the  packet  portion. 

9.  SE4  deletes  its  copy  and  sends  the  replica  token  to  node  B. 

10.  Node  B  now  has  a  tagged  replica  packet.  Node  B  extracts  the  UID 
contained  in  the  packet  and  compares  it  against  the  list  of  replicas  it 
has  received  so  far.  Since  it  has  not  been  seen  before,  node  B  adds  it 
to  the  list  and  processes  the  packet. 

11.  Meanwhile,  SEl  notices  that  the  channel  between  SEl  and  SE2  has 
gone  down.  It  changes  its  token  from  unique  to  replica.  It  then  takes 
its  copy  of  the  packet  and  sends  it  to  SE3. 

12.  SE3  transmits  the  packet  to  SE4  and  sends  an  acknowledgment  to  SEl. 

13.  When  SEl  has  both  the  acknowledgment  and  the  token,  it  deletes  its 
copy  of  the  packet  and  sends  the  replica  token  to  SE3. 
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14.  SE4  sends  the  packet  to  node  B  and  sends  an  acknowledgment  to  SE3. 

15.  SE3  now  has  the  token  and  an  acknowledgment,  so  it  deletes  the  copy 
and  sends  the  token  to  SE4. 

16.  Node  B,  being  the  final  destination  of  the  tagged  packet,  sends  an 
acknowledgment  to  SE4  as  soon  as  it  receives  the  packet  portion. 

17.  SE4  deletes  its  copy  and  sends  the  replica  token  to  node  B.  Node  B 
now  has  a  tagged  replica  packet,  so  it  extracts  the  UID  and  checks  its 
list.  Since  the  UID  matches,  the  packet  is  discarded. 

We  see  that  although  node  B  has  received  two  copies,  it  processes  only 
one.  Thus,  for  this  simple  example,  the  protocol  works. 


4.4  Proof  of  Correctness 

The  example  given  shows  that  the  protocol  works  correctly  in  a  select  case. 
We  will  now  show  that  the  protocol  is  correct  in  general.  The  correctness 
properties  we  will  show  are: 

•  a  copy  of  the  packet  is  always  delivered, 

•  a  packet  is  cdways  delivered  with  exactly  one  token,  and 

•  if  a  packet  is  delivered  tagged  with  a  unique  token,  then  it  is  the  only 
only  copy  of  that  packet  delivered. 

We  begin  by  restating  our  assumptions  about  the  network.  We  require 
that  the  network  remain  fully  connected  after  all  faults  have  occurred,  to 
prevent  stranding  a  packet  in  the  network.  The  router  must  be  able  to  route 
around  any  faults.  The  faults  are  single  points  of  failure  and  do  not  occur  too 
closely  together  in  time.  Lastly,  we  require  that  progress  delivering  tokens 
or  packets  is  always  being  made. 

Theorem  1  If  a  a  node  s  sends  a  packet  to  destination  t,  a  copy  of  the 
packet  is  always  delivered. 
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Proof  The  unique  token  protocol  keeps  at  least  two  copies  of  any  packet 
in  the  network.  This  was  illustrated  in  figure  4.1,  where  the  packet  is  copied 
forward  before  the  rearmost  copy  is  erased.  The  fault  model  allows  the 
destruction  of  at  most  one  copy  per  fault.  After  the  fault  has  been  detected 
and  the  packet  started  on  a  new  path,  the  packet  is  again  duplicated.  Thus 
a  packet  will  always  be  delivered.  □ 

The  next  two  properties  are  difficult  to  prove,  as  the  protocol  relies  on 
implicit  channel  state  to  bind  a  token  and  a  packet.  We  will  make  that 
binding  explicit  by  augmenting  the  token  with  the  destination  of  the  packet 
and  the  packet’s  unique  identifier.  This  restriction  requires  that  a  node 
actually  have  a  copy  of  the  packet  before  it  is  able  to  generate  a  token.  This 
does  not  violate  the  spirit  of  the  protocol,  as  a  node  need  only  erase  its  copy 
of  the  packet  before  transmission  of  a  token  and  we  observe  that  the  protocol 
will  only  send  tokens  after  it  received  a  packet. 

The  second  property  requires  two  lemmas,  to  show  that  at  most  one  copy 
and  at  least  one  copy  are  delivered. 

Lemma  2  A  node  receives  at  most  one  copy  of  a  token  for  each  packet  it 
receives. 

Proof  In  order  for  a  token  to  be  delivered,  it  must  have  been  sent  from 
another  node.  In  order  for  the  token  to  be  sent,  the  protocol  must  have 
passed  either  the  forward  token  or  send  replica  states  shown  in  figure  4.2.  In 
order  to  reach  either  of  these  states,  a  copy  of  the  packet  must  have  been 
sent.  Once  the  token  is  sent,  the  packet  and  the  token  are  erased.  It  is  not 
possible  to  generate  additional  copies,  £is  the  unique  identifier  is  lost.  □ 

Lemma  3  A  node  receives  at  least  one  token  for  each  packet  it  receives. 

Proof  If  a  node  receives  a  packet  on  a  channel  and  communication  is 
not  subsequently  severed,  the  node  will  eventually  receive  a  token,  owing  to 
the  network’s  progress  property.  If  communication  is  severed  on  the  channel 
before  it  receives  a  token,  the  node  will  generate  one  new  token  and  bind  it 
to  the  packet,  as  shown  in  the  wait  for  token  and  send  replica  states  of  figure 
4.2.  □ 
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Theorem  4  A  packet  is  received  with  exactly  one  token. 


Proof  We  combine  Lemma  2  and  Lemma  3.  □ 

Lemma  S  If  a  unique  token  is  transmitted  from  a  node,  only  one  copy  of  a 
packet  was  transmitted  from  that  node. 

Proof  Tokens  are  only  transmitted  in  states  forward  token  and  send 
replica.  In  order  for  a  unique  token  to  be  transmitted,  the  protocol  must 
have  passed  through  the  forward  token  state.  There  is  only  one  set  of  state 
transitions  which  allow  the  protocol  to  pass  through  that  state.  On  that  set 
of  transitions,  the  packet  is  transmitted  exactly  once  in  state  forward  packet. 
□ 


Theorem  6  If  a  packet  tagged  with  a  unique  token  is  delivered  to  a  destina¬ 
tion,  then  it  is  the  only  copy  of  that  packet  delivered. 

Proof  By  contradiction.  Suppose  that  multiple  Copies  of  a  packet  were 
delivered  to  the  destination,  along  with  a  unique  token  on  one  of  the  packets. 
In  order  for  multiple  copies  of  packet  to  be  delivered,  a  node  between  the 
source  and  destination  must  have  transmitted  multiple  copies.  At  the  node 
where  multiple  packets  were  transmitted,  none  of  the  transmitted  packets 
were  tagged  unique.  As  those  packets  found  their  way  to  the  destination, 
none  of  the  tokens  could  have  been  converted  to  type  unique,  so  no  unique 
tokens  could  have  been  delivered.  □ 


4.5  Performance  and  Scalability 

Under  the  unique  token  protocol,  assuming  error-free  operation,  each  node 
transmits  a  packet,  an  acknowledgment,  and  a  token.  The  token  is  quite  small 
and  can  be  represented  by  a  single  bit.  The  acknowledgment  is  only  used 
between  nodes  and  could  be  represented  in  a  few  bits.  Thus,  the  bandwidth 
between  nodes  is  used  primarily  for  moving  the  packet.  The  end-to-end 
protocol  required  that  the  acknowledgment  be  sent  end-to-end,  consuming 
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bandwidth  equal  to  that  of  a  small  packet.  Therefore,  in  terms  of  wire 
bandwidth,  the  unique  token  protocol  is  superior. 

The  unique  token  protocol  was  designed  to  eliminate  the  nonlinear  growth 
in  storage  of  end-to-end  protocols.  There  are  two  storage  components  of 
concern:  the  storage  in  the  source  and  the  duplicate  elimination  tables.  The 
unique  token  protocol  eliminates  the  storage  in  the  source.  Instead,  it  uses 
storage  in  the  network.  From  the  discussion  of  packet  replication,  we  see 
that  the  network  requires  buffers  for  at  most  three  copies  of  any  packet.  This 
compares  well  to  the  two  copy  requir«nent  for  ordinary  store  and  forward 
(copy  forward,  then  erase).  When  wormhole  routing  is  used,  the  buffering 
requirements  approach  two  copies  for  the  unique  token  protocol  and  one  copy 
for  the  end-to-end  protocols.  The  buffering  required  is  a  fixed  amount  per 
switching  node,  which  implies  that  the  amount  of  memory  in  the  system 
dedicated  to  packet  replication  grows  linearly  with  the  size  of  the  machine. 

Duplicate  elimination  must  still  occasionally  be  performed  at  the  desti¬ 
nation,  requiring  tables.  However,  the  tables  need  only  log  packets  whose 
tokens  are  of  type  replica,  not  those  tagged  as  unique.  Under  error-free  con¬ 
ditions,  the  log  will  be  empty.  When  an  error  occurs,  only  a  fixed  number  of 
replicas  can  be  created  before  the  network  heals  and  packets  are  adaptively 
routed  around  the  fault.  Restated,  faults  only  cause  replica  of  packets  to  be 
made  when  those  packets  were  near  the  fault  around  the  time  of  failure,  not 
as  a  result  of  a  static  zulaptation.  If  faults  occur  one  at  a  time,  only  a  fixed 
number  of  replicas  can  ever  be  sent  to  a  destination,  allowing  the  duplicate 
elimination  log  to  be  of  fixed  size.  Again,  a  fixed  size  per  destination  node 
allows  a  linear  growth  in  the  memory  requirements  of  the  system. 

In  terms  of  latency,  the  protocol  does  not  delay  the  arrival  of  the  packet 
at  the  destination,  even  if  the  token  is  slowed.  One  simply  winds  up  with 
an  excess  number  of  copies  of  the  packet  distributed  along  the  path  between 
the  token  and  the  destination.  However,  since  the  packet  arrives  at  the 
destination  while  the  token  is  at  least  two  nodes  away,  there  is  approximately 
a  two  hop  delay  before  the  destination  knows  the  type  of  token. 

This  need  not  be  an  impediment  to  the  start  of  processing  the  packet. 
If  one  is  able  to  bound  the  maximum  packet  latency  in  the  system,  one  is 
able  to  timestamp  the  entries  in  the  duplicate  elimination  log  and  discard 
them  after  all  possible  copies  have  been  received  owing  to  the  time  constraint. 
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Since  we  expect  the  maximum  delivery  time  to  be  small,  say  on  the  order 
of  milliseconds,  and  the  interarrival  time  of  failures  to  be  large,  say  on  the 
order  of  tens  of  seconds,  most  of  the  time  the  log  will  be  empty.  When  the 
log  is  empty,  we  can  safely  process  any  packet  which  arrives,  as  long  as  the 
tokens  for  all  previous  packets  have  been  seen. 

Of  course,  if  several  packets  arrive  before  their  tokens,  due  to  the  time 
multiplexing  of  several  packets  onto  the  same  physical  channel,  one  is  always 
able  to  do  a  small  amount  of  ordinary  duplicate  elimination  as  in  the  end-to- 
end  protocols.  The  size  of  the  table  required  for  this  corresponds  exactly  to 
the  number  of  packets  time-multiplexed  onto  the  physical  channel.  Again, 
these  are  all  fixed  size  tables,  so  the  overall  storage  requirement  grows  linearly 
with  the  size  of  the  system. 


4.6  Summary 

In  this  chapter,  we  described  a  protocol  which  will  reliably  deliver  messages 
without  the  use  of  an  end-to-end  protocol.  The  protocol’s  correctness  was 
demonstrated  in  an  example  and  then  proven  correct.  The  new  protocol 
only  requires  constant  amounts  of  storage  in  each  node  to  allow  replication 
of  packets.  Thus,  the  protocol  imposes  no  limits  on  machine  scaling  owing 
to  memory  constraints.  The  protocol  was  shown  to  be  more  wire  bandwidth 
efficient  than  end-to-end  protocols.  It  was  also  shown  not  to  increase  the 
latency  from  the  start  of  message  transmission  to  the  start  of  message  pro¬ 
cessing.  On  all  three  critical  measures:  memory,  bandwidth,  and  latency,  the 
unique  token  protocol  is  superior  or  equal  to  end-to-end  protocols. 
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Chapter  5 


Switching  Element 
Organization 


In  the  preceding  chapter  we  presented  the  unique  token  protocol  which  is  to 
be  implemented  in  the  network.  We  need  to  understand  the  costs  of  such 
an  implementation.  In  developing  the  organization  of  the  switching  element, 
we  will  gauge  its  complexity  and  look  for  unforeseen  costs  or  benefits.  We 
choose  2D  meshes  as  the  base  network,  as  they  afford  simple  non-reliable 
implementations  and  the  complexity  of  a  reliable  version  will  be  due  mainly 
to  the  reliability  features.  Sinularly,  we  organize  the  parts  of  the  design  for 
clarity,  not  for  performance,  as  we  want  the  design  to  be  intuitively  correct. 
This  chapter  will  describe  how  a  reliable  switching  element  intended  for  use 
in  2D-mesh  networks  may  be  organized. 


5.1  Information  Exchange 

The  switching  element  has  interfaces  to  its  four  geographic  neighbors  and  its 
processor,  as  shown  in  figure  5.1.  Here,  the  same  type  of  interface  is  used  to 
connect  the  switching  element  to  the  processor  as  is  used  to  connect  to  the 
neighboring  switching  elements,  although  it  need  not  be  the  case  in  other 
designs.  To  form  a  network,  the  switching  elements  are  tiled  into  a  plane 
and  are  assigned  (x,  y)  coordinates. 
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North 


South  Processor 

Figure  5.1:  Switching  element  for  2D  mesh 

To  describe  how  these  switching  elements  communicate,  we  will  use  the 
terminology  suggested  by  Dally  [7].  Communication  between  processing  ele¬ 
ments  is  performed  by  sending  messages  through  the  network.  Messages  may 
be  arbitrarily  long,  so  a  message  may  have  to  broken  into  one  or  more  packets 
for  transmission.  Packets  contain  sequencing  information  to  allow  them  to 
be  reassembled  into  a  message.  They  also  contain  routing  information  and 
are  the  smallest  unit  of  information  exchange  to  do  so.  The  routing  informa¬ 
tion  is  specified  as  a  destination  (x,  y )  coordinate  pair.  Packets  must  carry  a 
unique  identifier  for  possible  duplicate  elimination  at  the  destination.  This 
unique  identifier  is  simply  the  source's  (x,y)  coordinate  pair  concatenated 
with  a  small  sequence  number. 

A  packet  contains  three  or  more  flow  control  digits  or  flits.  They  are  the 
smallest  unit  on  which  flow  control  is  performed.  By  decomposing  the  packets 
into  flits,  the  packet  may  be  wormhole  routed  over  virtual  channels.  Flits 
from  multiple  packets  are  time  division  multiplexed  onto  a  single  physical 
channel,  forming  the  virtual  channels.  The  virtual  channeb  associated  with 
a  single  physical  channel  share  physical  bandwidth,  allocated  on  a  flit  by 
flit  basis.  Since  the  flits  must  be  demultiplexed  when  received,  they  carry  a 
virtual  channel  identifier. 

Flits  come  in  three  main  kinds;  head,  data,  and  token.  A  packet  is 
made  of  one  head  flit,  one  or  more  data  flits,  and  one  token.  The  head 
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flit  contains  the  routing  information.  The  first  data  flit  carries  the  unique 
identifier.  Subsequent  data  flits  are  used  for  the  actual  data  transport.  The 
token  flit  cannot  carry  any  data  and  comes  in  two  sub-kinds:  unique  and 
replica.  Tokens  of  sub-kind  replica  are  generated  during  network  failures. 

Flits  carry  sequence  numbers  to  allow  their  reassembly  into  a  packet  in 
the  event  that  that  network  failure  cuts  the  packet  in  two.  This  reassembly 
information  could  be  maintained  as  a  starting-flit-number  field  in  the  head 
flit  of  the  packet.  It  is  not  done  here  in  order  to  simplify  the  design. 

Flits  are  transferred  over  physical  channels  in  physical  transfer  units  or 
phits.  A  phit  is  usually  smaller  than  a  flit  as  one  encounters  pin-count  re¬ 
strictions  on  actual  IC  devices.  For  example,  36  data  bits  over  7  bidirectional 
channels  would  require  504  signalling  pins.  In  this  particular  case,  we  wanted 
to  keep  the  total  pin-count  for  the  switching  element  in  the  neighborhood 
of  100  pins.  With  5  ports,  this  came  out  to  be  under  10  pins  per  phit,  so  a 
width  of  8  bits  was  chosen  for  a  phit.  One  of  the  8  bits  is  used  for  parity,  so 
7  remain  for  the  movement  of  data  between  devices. 

At  least  6  phits  are  needed  to  transport  a  data  field  of  36  bits.  After 
counting  11  additional  bits  for  virtual  channel  numbers,  sequence  numbers, 
and  kind  fields,  we  see  that  7  phits  are  nearly  fully  used  in  transporting 
a  flit.  Other  information,  such  as  acknowledgments  and  alarms  must  be 
communicated  between  devices.  To  make  efficient  use  of  the  wires  between 
devices,  we  use  the  same  wires  as  are  used  to  in  transporting  the  flit  and  time 
multiplex  onto  them  this  additional  information.  The  time  multiplexing  is 
done  in  a  fixed  manner  and  we  call  the  structure  of  the  time  multiplexing  a 
frame.  Figure  5.2  shows  the  complete  frame  format. 


5.2  Block  Diagram 

The  switching  element,  as  shown  in  figure  5.3,  is  composed  of  four  types 
of  elements.  They  are  input  controllers,  a  switch,  output  controllers,  and  a 
router.  Before  delving  into  the  details  of  each  module,  a  few  top  level  items 
need  to  be  explained. 

The  first  item  is  where  to  put  the  the  buffers.  Since  the  switching  element 
does  not  drop  flits  under  contention,  buffering  for  flits  must  be  provided.  In 
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Figure  5.3:  Switching  element  block  diagram 


1 1 1  I  I  III 


this  organization,  the  buffering  was  chosea  to  be  solely  in  the  input  controller. 
This  was  done  for  several  reasons.  The  first  is  that  the  presence  of  output 
queueing  implies  that  the  peak  transfer  rate  out  of  the  switch  is  higher  than 
the  peak  transfer  rate  of  the  physical  channel,  which  requires  wider  internal 
data  paths  or  higher  internal  clock  rate.  Both  of  these  seem  unlikely  in 
practical  applications.  The  second  reason  has  to  do  with  the  generation  of 
acknowledgments.  The  switching  element  cannot  send  an  acknowledgment 
upstream  for  a  flit  until  that  flit  has  successfully  been  copied  downstream. 
If  queueing  is  done  in  the  output  controller,  multiple  output  controllers  may 
decide  to  send  acknowledgments  over  the  same  physical  channel  at  the  same 
time,  contending  for  use  of  the  acknowledgment  signals.  Solutions  involve 
either  separate  acknowledgment  signals  for  each  virtual  channel  or  some  form 
of  queueing  on  acknowledgments,  both  of  which  appear  to  be  impractical. 

The  second  item  deserving  explanation  at  this  point  is  the  use  of  a  single 
centralized  router.  The  motivation  behind  this  is  that  the  router  allocates 
resources,  the  output  virtual  channels.  If  the  router  were  to  be  distributed 
among  the  input  controllers,  some  form  of  arbitration  for  those  resources 
would  have  be  implemented.  Although  distributing  the  router  is  desirable 
to  avoid  queueing  delay  for  the  centralized  router,  the  arbitration  for  virtual 
channels  and  communication  of  deadlock  avoidance  information  is  complex. 
Attempting  to  describe  such  a  distributed  router  is  deemed  outside  the  scope 
of  this  thesis. 

The  last  item  is  that  the  entire  switching  element  is  flit  synchronous. 
This  was  done  to  avoid  the  possibility  of  livelock  arising  from  having  control 
loops  longer  than  the  fundamental  unit  of  message  synchronization.  If  the 
switching  element  were  to  operate  phit-synchronously,  one  would  have  to  go 
through  all  control  loops  and  guarantee  that  livelock  could  not  arise.  It  is 
far  simpler  to  make  the  unit  of  synchronization  longer  than  any  control  loop 
and  suffer  a  performance  loss. 

5.2.1  Input  Controller  Overview 

The  input  controller,  shown  in  figure  5.4,  logically  consists  of  four  submod¬ 
ules:  a  loader,  a  dual-ported  RAM,  a  pointer  file,  and  an  unloader.  The 
loader  places  flits  into  the  dual-ported  RAM.  The  unloader  is  responsible  for 


33 


§ 


I 


u 


Figure  5.4:  Input  controller  block  diagram 


maint2dning  all  channel  state,  interacting  with  the  router  and  output  con¬ 
trollers,  and  removing  flits  from  the  RAM.  The  pointer  file  keeps  all  the 
read/write  pointers  into  the  RAM. 

The  loader  contains  modules  to  mceive  phits  from  the  physical  channel, 
check  the  parity  and  CRC  fields,  and  place  flits  into  the  dual  ported  RAM. 

The  loader  communicates  the  virtual  channel  state  information  to  the 
unloader  via  a  pointer  file.  The  pointer  file  maintfuns  both  write  and  read 
pointers  into  the  RAM  for  all  virtual  channeb.  The  pointer  file  also  keeps  a 
count  of  the  number  of  acknowledgments  sent  for  each  virtual  channel. 

The  unloader  assigns  one  of  three  states  to  each  input  virtual  channel: 
idle,  ivaitingFor Route,  or  active.  The  unloader  monitors  the  pointer  file, 
looking  for  channels  with  unsent  flits.  This  is  determined  by  a  disparity  in 
read  and  write  pointers.  If  the  channel  is  idle,  the  first  flit  is  sent  to  the 
router  and  the  channel  state  changed  to  waitingForRoute.  If  the  channel  is 
active,  the  flit  is  forwarded  to  correct  output  port.  If  the  channel  is  in  the 
waitingForRoute  state,  nothing  is  done. 

When  the  router  establishes  a  route,  it  changes  the  state  of  the  virtual 
channel  to  active.  If  the  route  must  be  retried,  the  state  is  changed  to  idle. 

The  unloader  contains  the  logic  for  generating  acknowledgments.  An 
acknowledgment  must  be  delayed  a  few  flit  times  after  the  flit  enters  the 
crossbar,  to  allow  the  error  checking  on  the  output  link  to  take  place. 

The  unloader  also  contains  logic  for  changing  flits  of  type  tokenUnique 
to  type  tokenReplica,  as  well  as  being  able  to  generate  a  flit  of  type  token- 
Replica. 
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Figure  5.5:  A  <•  oai.  Input  1  is  connected  to  output  2,  2  to  1,  3  to  4,  4  to 
3,  and  5  to  6. 

5.2.2  Crossbar  Overview 

The  crossbar  moves  flits  from  input  controllers  to  output  controllers  and 
the  router,  which  results  in  an  asymmetric  crossbar.  The  input  controller 
submits  a  bid  to  the  crossbar,  indicating  to  which  output  it  wishes  to  send 
a  flit.  The  crossbar  responds  with  a  bid-accepted  signzd  if  the  crosspoint  is 
established.  The  output  controller  is  told  of  the  crosspoint  by  the  column 
valid  signal.  A  typical  crossbar  is  shown  in  figure  5.5. 

The  output  controller  or  router  indicates  acceptance  of  a  flit  by  assert¬ 
ing  the  destination-accepts  signal.  The  output  controller  may  refuse  a  flit 
if  transmission  of  the  flit  would  overflow  the  downstream  input  buffer.  In 
addition,  the  crossbar  provides  an  indicator  of  output  channel  health  back 
to  the  input  port. 

5.2.3  Output  Controller  Overview 

The  output  controller  is  composed  of  two  submodules,  the  output  filter  and 
the  output  packager.  The  output  filter  maps  input  <port,virtual  channel> 
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tuples  to  output  virtual  channels.  The  virtual  channel  field  in  the  flit  is 
altered  in  this  submodule.  The  output  filter  also  implements  virtual  channel 
flow  control  by  not  accepting  a  flit  from  the  crossbar  when  sending  that  flit 
would  cause  a  bufler  to  overflow. 

The  output  packager  computes  and  affixes  parity  for  the  first  and  last 
phits  of  a  flit.  Parity  for  the  intermediate  phits  are  end-to-end.  In  addition, 
the  output  packager  computes  and  affixes  the  CRC  field. 

5.2.4  Router  Overview 

The  router  is  responsible  for  the  mapping  of  input  <port, virtual  channel> 
tuples  to  output  ports.  The  router  receives  head  flits  from  the  crossbar. 
The  router  disassembles  the  flit  into  a  dimension  reversal  (DR)  number  and 
destination  x,  y,  and  z  coordinates.  The  DR  number  is  broadcast  to  all  output 
controllers.  Each  output  controller  responds  with  three  pieces  of  information: 

•  if  the  output  port  is  up, 

•  if  there  are  any  free  virtual  channels,  and 

•  if  there  are  no  free  channels,  whether  or  not  the  packet  can  wait  for  a 
virtual  channel  to  free  up. 

The  router  then  attempts  to  perform  dimension  ordered  routing.  If  the 
desired  output  controller  is  down,  an  output  controller  is  selected  at  random. 
If  the  targeted  output  has  no  free  channels  and  the  pacl»t  cannot  wait  due 
to  DR  number  constraints,  a  new  output  controller  is  picked. 

If  the  targeted  output  controller  has  no  free  virtual  channel,  the  router 
tells  the  input  controller  to  retry  the  route  request.  If  there  is  a  free  virtual 
channel,  the  router  signals  the  output  controller  to  reserve  a  virtual  channel 
at  a  given  DR  level.  The  router  also  notifies  the  input  controller  that  a 
route  has  been  established  and  that  the  input  controller  may  proceed  with 
forwarding  the  packet. 
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5.3  Summary 


In  this  chapter,  we  presented  the  top-level  organization  of  a  switching  ele¬ 
ment  implementing  the  unique  token  protocol.  The  communication  between 
switching  elements  was  described.  A  block  diagram  of  the  switching  element 
was  explained  and  several  architectural  motivations  were  given.  Details  of 
the  four  top-level  blocks  were  then  filled  in.  We  noted  that  this  particular 
organization  was  chosen  to  minimize  the  design  complexity  of  the  microar¬ 
chitecture,  which  is  the  subject  of  the  next  chapter. 
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Chapter  6 


Deteiils  of  the  Switching 
Element  Design 


The  low-level  architecture  and  design  must  look  ahead  to  the  physical  imple¬ 
mentation.  The  easiest  implementations  (for  humans)  using  our  CAD  tools 
are  done  using  standard  ceils.  This  implies  that  the  logic  complexity  cannot 
be  too  great,  nor  can  more  than  a  few  specialized  structures  be  built.  Stan¬ 
dard  cell  designs  are  also  difficult  to  optimize  for  speed,  so  we  do  not  attempt 
to  push  the  library  cells  to  their  limits.  We  also  restrict  ourselves  to  a  clock 
methodology  and  a  sequential  logic  style  which  may  be  readily  verified  using 
a  timing  analyzer  such  as  Crystal. 

Design  guidelines  are  specified  to  support  the  design  goals.  The  guidelines 
cover  the  clocking  methodology  and  how  hand  timing  calculations  may  be 
done.  They  discuss  documentation  style,  the  use  of  custom  circuits  and 
circuit  conventions. 

A  detailed  architecture  for  the  switching  element  is  presented.  It  is  given 
at  the  module  level,  which  should  correspond  to  a  page  or  two  of  schematics 
per  module.  For  each  module,  we  list  its  inputs  and  outputs  and  give  a 
description  of  the  functionality  contained  in  the  module.  The  signal  names 
used  in  the  descriptions  are  intended  to  be  the  same  as  those  used  in  the 
M-language^  model  of  this  architecture,  found  in  the  appendix. 

*  Mentor  Graphics,  Silicon  Design  Division 
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6.1  Design  Goals  and  Guidelines 


The  primary  design  goal  is  to  demonstrate  the  logical  correctness  of  the 
unique  token  protocol  under  operating  conditions.  The  protocol  is  built 
on  top  of  a  full-featured  router,  one  that  is  both  adaptive  and  provably 
deadlock  free.  This  design  base  is  non-trivial  and  we  try  to  simplify  the  design 
wherever  possible.  We  therefore  minimize  the  number  of  basic  mechanisms 
and  do  not  perform  some  obvious  optimizations.  For  example,  portions  of 
the  router  could  be  distributed  among  the  input  controllers,  optimized  for 
packets  traveling  in  straight  line.  This  is  a  common  case  and  would  eliminate 
congestion  at  the  router  module.  However,  it  would  add  considerable  logic 
complexity  and  chip  area,  so  it  is  deemed  to  be  not  worth  the  risk. 


6.1.1  Two-Phase  Clocks 

The  chip  is  synchronous  and  derives  all  timing  from  a  two- phase  non-overlapping 
clock.  An  single  external  clock  is  supplied  and  the  clock  buffer  logic  will  in¬ 
ternally  generate  the  two  non-overlapping  phases  PHI  and  PH2.  The  two 
high-true  phases  PHI  and  PH2  are  distributed  across  the  entire  switching 
element.  The  low-true  versions  PH  In  and  PH2n  must  be  derived  locally  in 
each  module. 

For  the  MOSIS  2fi  process,  r  is  about  75ps.  Our  goal  is  a  clock  period  of 
600t,  200r  per  phase  with  lOOr  of  non-overlap.  Figure  6.1  illustrates  where 
the  nonoverlap  time  of  lOOrgoes.  Intermodule  clock  skew  is  +/—  25t.  For 
SPICE,  we  use  slow-N,  slow-P,  at  50C  and  4.5v. 

Strict  two-phase  discipline  is  used  throughout  the  chip.  All  signals  are 
labeled  with  their  type,  one  of: 
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Figure  6.1:  Where  the  nonoverlap  time  goes 


Valid  during  phase  1.  Si  signab  are  the  outputs  of  latches 
clocked  by  PH2  or  signals  combinatorially  derived  from  other 
VI  signals. 


Valid  during  PH2. 


Qualified  by  PHI.  Ql  signals  are  asserted  only  during  PHI. 
PHI  is  a  Ql  signal.  Other  Ql  signals  are  derived  by  ANDing 
PHI  with  VI  signals. _ 


Q2  I  Qualified  by  PH2. 


SI  Setup  during  PHI.  Occasionally,  a  signal  originating  at  a  PH2 
latch  will  have  a  propagation  time  longer  than  1/2  a  clock 
period.  Such  signals  will  met  the  setup  time  for  a  PHI  latch, 
but  cannot  be  used  to  generate  a  Ql  signal.  Note  that  all  VI 
signals  are  Si  signals,  so  a  VI  signal  may  safely  be  supplied  to 
any  input  requiring  an  Si  signal.  _  _ 


Setup  during  PH2. 


The  remaining  restrictions  are: 


1.  Maximum  delay  to  a  Ql  (Q2)  signal  from  PHI  (PH2)  is  50r.  Delay  is 
measured  from  50%  of  input  to  50%  of  output. 
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2.  Maximum  rise(faU)  time  (10%  to  90%)  of  a  Q  signal  is  25r. 

3.  There  are  no  minimum  delay  constriunts. 

4.  All  combinatorial  delays  must  be  less  than  250r  from  PHI  or  PH2 
except  where  expressly  noted.  This  is  to  allow  a  very  simple  timing 
verifier  to  operate. 

5.  Each  latch  must  be  supplied  with  a  Ql  clock  and  Si  data  or  a  Q2  clock 
and  S2  data. 

6.  All  feedback  paths  must  be  broken  by  a  latch. 

7.  All  signals  sourced  by  a  module  should  be  V2. 

8.  All  internal  modules  should  latch  their  inputs  on  PH2,  operate  for  zero 
or  more  cycles,  and  then  drive  their  outputs  starting  in  PHI. 

9.  Modules  should  assume  that  their  inputs  are  S2,  as  they  may  still  be 
changing  during  PH2,  on  account  of  clock  skew  between  modules.  They 
will  be  stable  IOOt  before  the  end  of  PH2. 

10.  All  modules  used  should  use  only  fully  restored  static  CMOS  logic. 

1 1 .  All  signals  should  be  pulled  to  VDD  or  GND  at  all  times.  No  precharged 
or  pseudo- NMOS  circuits  should  be  employed.  All  transmission  gates 
should  include  both  N  and  P  devices. 

12.  All  modules  should  be  designed  so  that  all  storage  nodes  are  static 
during  PHI.  If  the  clock  is  stopped  with  PHI  high,  no  data  should  be 
lost. 

13.  All  static  latch  feedback  loops  should  connect  only  to  the  gates  of 
transistors  to  avoid  dynamic  charge  sharing. 

6.2  Top  Level  Microarchitecture 

With  the  design  guidelines  stated,  the  details  of  the  microarchitecture  can 
be  presented,  starting  at  the  top  level.  The  block  diagram,  as  presented  in 
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chapter  4,  did  not  provide  any  details  of  the  wiring  between  modules.  It 
is  redrawn  in  figure  6.2  to  explicitly  call  attention  to  the  module  intercon¬ 
nect.  A  representative  of  each  modules  is  shown  is  figure  6.3,  along  with  the 
intermodule  signals  used. 


The  chip  interface  consists  of  support  signab,  such  as  clock  and  reset, 
together  with  the  signals  used  to  communicate  with  other  switching  elements. 


Name 

I/O 

Description 

clock 

I 

The  single  phase  clock. 

reset 

I 

The  chip  reset. 

I 

The  port  data  in. 

lanmitgiii 

0 

The  port  data  out. 

6.3  Timing  Generation 

The  timing  generator  tim.tim  provides  a  chip  wide  timebase  for  flit  relative 
timing.  Since  there  are  8  phits  per  flit,  the  timing  module  provides  8  outputs, 
phitTime(7..0].  All  are  V2  signals.  PhitTimefO]  corresponds  to  the  time  when 
the  first  phit  of  a  flit  appears  on  the  physical  channel  between  IC’s. 
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Figure  6.2:  Top  Level  Module  Block  Interconnect 
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Figure  6.3;  Top  Level  Inter-Module  SignaJling 
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6.4  Input  controller  -  ict  Jet 

Module  I/O 


Name 

I/O 

Description 

d[7..0] 

I 

One  port’s  worth  of  physical  input  wires  to 
the  chip. 

"qfO 

0 

Flit  data  sent  to  the  crossbar 

bid[2..0) 

0 

The  output  controller  or  router  to  send  the 
flit  to. 

bid^cepted 

I 

The  crossbar  accepted  the  bid,  i.e.  a  connec¬ 
tion  has  made  between  the  input  controller 
and  the  desired  output  port. 

dst  accepts 

I 

The  connected  output  controller  accepted 
the  flit  sent. 

dst-alarm 

I 

The  connected  output  controller  is  down. 

alarm 

0 

The  input  controller  has  detected  some  form 
of  error,  possibly  parity,  CRC,  alarm  re¬ 
ceived,  or  other  protocol  violation. 

alarms[4..0] 

I 

The  alarms  from  all  input  controllers. 

j:!  j  !1 

0 

The  last  acknowledgment  received. 

0 

The  acknowledgment  that  the  paired  output 
controller  should  send. 

setjoute 

I 

A  control  input  from  the  router,  it  is  used  to 
indicate  that  a  route  has  been  established. 

retry  joute 

I 

A  control  input  from  the  router,  it  is  used 
to  indicate  that  a  route  has  not  been  estab¬ 
lished  and  should  be  rerequested  later. 

vc_being  j'outed[2.  .0] 

I 

The  virtual  channel  that  the  router  is  cur¬ 
rently  updating. 

route^ort[2..0] 

I 

When  a  route  is  established,  this  is  the  target 
exit  port. 
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Description 

The  input  controller  is  composed  of  only  of  submodules,  no  logic  is  imple¬ 
mented  at  this  level.  The  block  levri  interconnect  is  shown  in  figure  6.4.  The 
submodules  include: 

input  port,  which  captures  incoming  data, 
checker,  which  checks  incoming  data, 

low  address  generation,  which  provides  a  phit  offset  into  the  flit  storage, 
dual  ported  RAM,  where  the  flits  are  stored, 
address  map,  which  converts  a  flit  sequence  number  into  a  pointer, 
pointer  file,  where  the  FIFO  pointers  for  each  virtual  channel  are  kept, 
load  control,  which  controls  the  entry  of  flits  into  the  FIFOs,  and 
unloader,  which  removes  flits  from  the  FIFOs  and  maintains  channel  state. 


6.4.1  Input  Port  >  ipJp 


Module  I/O 


Name 

I/O 

Description 

I 

One  port’s  worth  of  physical  input  wires  to  the  chip. 

0 

Flit  data  sent  to  the  dual  ported  RAM 

raw_q(7..0] 

0 

Flit  data  sent  to  the  flit  checker 

0 

The  acknowledgment  just  received,  it  is  sent  to  the 
paired  output  controller. 

Description 

The  input  port  latches  the  external  data  using  phi.  This  “raw”  data  is  sent 
to  the  checker  for  verification  of  parity,  CRC,  and  alarm  fields.  The  data  is 
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Figure  6.4:  Input  Controller  Block  Diagram 
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then  delayed  a  couple  of  phits  and  sent  to  the  dual  ported  RAM.  The  delay 
is  necessary  to  allow  the  pointer  file  to  retrieve  the  correct  write  pointer.  It 
also  allows  the  checker  to  suppress  the  increment  of  the  write  pointer,  thus 
preventing  errored  flits  from  affecting  the  unloader.  Note  that  an  errored  flit 
will  be  written  into  the  dual  ported  RAM  using  a  possibly  errored  VC,  so  an 
extra  flit’s  worth  of  storage  must  be  added  to  each  input  buffer  to  account 
for  this  overwriting. 

6.4.2  Flit  Checker  -  ict.check 

Module  I/O 


Name 

I/O 

Description 

I 

The  raw  phit  stream  from  the  input  port. 

1  alarm 

0 

Alarm  is  asserted  when  any  error  is  detected. 

Description 

The  checker  checks  the  parity  every  clock  period.  It  also  monitors  the  alarm 
bit  sent  with  each  flit.  If  either  condition  ever  becomes  true,  the  alarm  is  set. 
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6.4.3  Pointer  File  -  pf-pf 

Module  I/O 


Name 

I/O 

Description 

alarm 

I 

The  alarm  signal  from  this  input  controller’s 
checker. 

load-vc[2..0 

1 

The  virtual  channel  of  the  incoming  flit 

I 

The  virtual  channel  for  the  outgoing  flit 

ack_vc[2..0] 

I 

The  virtual  channel  of  the  flit  just 
acknowledged 

incr^ck 

I 

Control  signal  which  causes  the  acknowledg¬ 
ment  count  for  a  virtual  channel  to  increment. 

loadjopcode[2..0] 

I 

Operation  desired  by  the  loader. 

unload  jopcode[2..0] 

I 

Operation  desired  by  the  unloader. 

read^tr[5..0] 

0 

The  sequence  number  of  the  flit  to  be  read 
from  the  dual  ported  RAM. 

write4>ointer[5..0] 

0 

The  sequence  number  of  the  flit  to  written  to 
the  dual  ported  RAM. 

flits.to  jsend  [4 .  .0] 

0 

Indicates  which  virtual  channels  are  non¬ 
empty. 

send.ack[4..0] 

0 

Indicates  which  virtual  channels  have  for¬ 
warded  flits  and  not  yet  acknowledged  them. 

Description 

The  pointer  file  maintains  4  counters  for  each  input  virtual  channel:  a  write 
pointer,  an  acknowledgment  count,  a  head-of- packet  read  pointer,  and  a  data 
read  pointer.  It  also  outputs  two  signals  per  virtual  channel:  flitsJ;o_send 
and  send_ack. 

Two  read  pointers,  not  one,  are  needed  to  handle  retransmissions  for  the 
tail  ends  of  long  packets.  When  a  packet  needs  to  be  retransmitted,  the  head 
flit  must  again  be  sent  to  the  router.  After  the  head  flit  is  transmitted,  the 
effective  read  pointer  might  need  change  to  point  at  any  location  in  the  input 
ring.  The  two  read  pointers  are  combined  together  to  form  an  effective  read 
pointer  in  the  following  way: 
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•  if  the  head-of-packet  read  pointer  is  less  than  the  number  of  head  flits, 
the  effective  read  pointer  is  the  head-of-packet  read  pointer. 

•  if  the  head-of-packet  pointer  is  equal  to  the  number  of  head  flits,  the 
effective  read  pointer  is  the  sum  of  the  head-of-packet  pointer  and  the 
data  read  pointer. 

When  the  read  pointer  is  to  be  incronented,  the  head-of-packet  pointer  is 
incremented  until  it  reaches  the  number  of  head  flits,  then  the  data  read 
p>ointer  is  incremented. 

The  write  pointer  is  used  by  the  loader  to  place  flits  into  the  dual  ported 
RAM.  When  the  write  pointer  is  greater  than  the  effective  read  pointer,  the 
signal  flits-to^end  is  asserted. 

The  acknowledgment  counter  is  not  externally  visible.  When  the  effective 
read  pointer  is  greater  than  the  acknowledgment  count,  the  signal  send.ack 
is  asserted. 

Backing  up 

This  is  a  bit  tricky.  Suppose  that  the  data  portion  of  the  input  buffer  is 
n  flits  long  and  the  write  pointer  is  w.  We  know  that  one  flit  of  the  n 
may  be  inadvertently  overwritten  when  input  channel  errors  occur.  We  also 
know  that  some  number  of  flits  may  sent  prior  to  an  acknowledgment  b«ng 
sent.  Call  this  threshold  number  t.  So,  the  t  +  1  flits  after  the  current  write 
pointer  may  be  overwritten.  We  can  therefore  only  back  the  read  pointer  to 
be  tt;  —  (n  —  (t  -t- 1))  =  tu  —  n  -I-  <  -1- 1  flits. 


6.4.4  Pointer  File  Load  Controller  -  pf_ctl 

Module  I/O 


Name 

I/O 

Description 

opcode[2..0] 

0 

The  opcode  to  the  pointer  file  for  flit  loading. 
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Description 


The  pointer  file  load  controller  initiates  a  fetch  of  the  write  pointer  for  each 
incoming  flit  and  increments  the  write  pointer. 

6.4.5  Dual  Ported  Ram  •  dpr-dpr 

Module  I/O 


Name 

I/O 

Description 

EIS^ 

I 

Data  in 

ql7..0] 

0 

Data  out 

I 

Read  address 

wa  9..0 

I 

Write  address 

we 

I 

Write  enable 

Description 

This  is  a  simple  dual  ported  RAM. 

6.4.6  Address  Generator  -  ict^ddr 

Module  I/O 


Name 

I/O 

Description 

mmm 

0 

Lower  three  bits  of  read  address. 

0 

Lower  three  bits  of  write  address. 

Description 

The  address  generator  provides  the  phit-within-a-flit  address  portion  of  the 
dual  ported  RAM’s  read  and  write  addresses. 
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6.4.7  Pointer  to  Address  Mapper  -  ict_pfmap 

Module  I/O 


Name 

I/O 

Description 

igyririnRMQW 

I 

Address  from  the  pointer  file. 

dpr^ddr[3..0] 

0 

Address  to  the  dual  ported  RAM. 

Description 

The  pointer  to  address  mapper  converts  the  0  to  63  range  of  pointers  to  a 
number  between  0  and  15  for  the  dual  ported  RAM.  The  function  is  0  maps 
to  0,  all  other  numbers  n  map  to  (n  mod  15)  +  1. 
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6.4.8  Unload  Controller  -  ict.unload 


Module  I/O 


Name 

I/O 

Description 

flits-tojsend[2..0] 

I 

These  inputs  from  the  pointer  file  indicate 
that  a  virtual  channel  queue  is  non-empty 
and  that  a  flit  should  be  forwarded. 

send^k[2..0] 

I 

These  inputs  from  the  pointer  file  indicate 
that  an  acknowledgment  for  a  virtual  chan¬ 
nel  should  be  sent  when  one  of  its  flits  is 
forwarded. 

make-replica 

0 

Forces  the  flit  being  forwarded  to  be  of  type 
tokenReplica. 

gen^eplica 

0 

When  the  tail  of  a  packet  is  lost,  no  token  will 
be  in  the  input  queue  and  one  must  be  gener¬ 
ated.  This  signal  forces  one  to  be  generated. 

pf-vc[2..0] 

0 

The  virtual  channel  sent  to  the  pointer  file 
to  obtain  its  current  read  pointer,  increment 
it,  back  it  up,  or  clear  it. 

dpr_vc[2..0] 

0 

Current  virtual  channel  number  for  the  dual 
ported  RAM.  This  is  simply  a  delayed  ver¬ 
sion  of  pf-vc. 

dprji(7..Q] 

I 

The  output  of  the  dual  ported  RAM  which 
the  unloader  needs  to  determine  the  flit’s 
kind. 

0 

Opcode  to  the  pointer  file. 

bid[2..0] 

0 

Desired  exit  port  out  of  the  crossbar. 

bid^cepted 

I 

The  crossbar  made  the  crosspoint. 

dst^ccepts 

I 

The  flit  was  accepted  by  the  output 
controller. 

dst.alarm 

I 

The  alarm  condition  at  the  output  controller. 

src_alarm 

I 

The  alarm  condition  in  the  input  loader. 
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Name 

I/O 

Description 

alaTms[4..0] 

I 

The  alarms  from  all  output  controllers,  used 
to  suppress  acknowledgments  as  late  as 
possible. 

I 

The  exit  port  to  use  for  a  virtual  channel. 

vcJbeing-routed  [2.  .0] 

I 

The  virtual  channel  the  routing  is  updating. 

set-route 

I 

The  router  established  the  route. 

retryjoute 

I 

The  router  was  unable  to  establish  the  route, 
but  the  packet  could  wait.  Try  again  later. 

0 

The  acknowledgment  to  send  out  through 
the  paired  output  controller. 

incr^ck 

0 

Increment  the  acknowledgment  count  in  the 
pointer  file. 

Description 

The  unload  controller  manages  most  of  the  state  associated  with  an  incoming 
virtual  channel.  The  state  information  is: 

vcjstate  Either  td/e,  waiting ForRoute,  or  active. 

vcJbacked-up  A  boolean  which  indicates  if  this  virtual  channel  had  had  to 
be  backed  up  and  rerouted. 

vc_to_port  The  target  exit  port. 

Initially,  the  state  for  all  virtual  channels  is  idle,  not  backed  up. 

When  the  loader  places  a  flit  into  the  input  buffer  for  a  virtual  channel, 
the  flits.tojsend  signal  for  that  virtual  channel  becomes  asserted,  Each  flit 
time,  the  unloader  scans  the  virtual  channels,  looking  for  a  virtual  channel 
with  flits  to  send  that  is  either  in  the  idle  or  active  states.  If  the  virtual 
channel  is  in  the  idle  state,  the  flit  is  sent  to  the  router  and  the  channel  state 
changed  to  waiting-for-route.  If  the  virtual  channel  was  in  the  active  state, 
the  flit  is  sent  to  the  destination  port  found  in  the  vc-to^ort  table. 

For  flits  of  type  head  and  data,  the  unloader  sends  the  flit  across  the 
crossbar  by  first  presenting  a  bid  to  the  crossbar.  If  the  bid  is  accepted  and 
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the  destination  accepts  the  flit,  the  read  pointer  is  incremented  in  the  pointer 
file.  This  has  the  side  effect  of  updating  the  flits_to^end  signal. 

For  flits  of  type  token,  the  situation  is  more  complex.  The  output  con¬ 
troller  will  refuse  the  flit  as  long  as  there  are  any  outstanding  unacknowledged 
flit  transmissions  for  this  virtual  channel.  This  is  to  ensure  that  all  the  flits 
in  the  packet  have  been  copied  to  two  places  before  freeing  up  the  input 
channel.  Thus,  when  the  input  controller  sends  a  flit  of  type  token  across 
the  crossbar,  the  state  associated  with  the  virtual  channel  is  initialized. 

For  all  kinds  of  flits,  if  the  destination  responds  with  an  alarm,  the  read 
pointer  in  the  pointer  file  is  backed  up,  the  channel  state  is  changed  to  idle, 
and  vc_backed_up  set.  This  eventually  causes  the  packet  or  packet  fragment 
to  be  re-routed.  When  the  token  for  the  packet  is  finally  transmitted,  the 
assertion  of  vcJbacked.up  causes  the  type  of  the  token  to  be  coerced  to  replica. 

The  unloader  only  notices  the  failure  of  the  physical  input  channel  when  a 
virtual  channel  is  in  the  active  state  and  there  are  no  more  flits  to  send.  The 
unloader  forces  the  generation  of  a  token  of  type  replica  and  then  initializes 
the  state  of  the  virtual  channel. 

Lastly,  the  unloader  generates  acknowledgments.  When  a  flit  is  success¬ 
fully  sent  to  an  output  controller,  the  virtual  channel  number  and  output 
controller  number  are  placed  in  a  3  deep  pipeline  which  is  shifted  every  flit 
time.  At  the  end  of  the  pipeline,  the  output  controller  is  checked  to  see  if  an 
alarm  has  been  set.  If  there  is  no  alarm  and  the  pointer  file  indicates  that 
an  acknowledgment  should  be  sent,  it  is  sent  and  the  acknowledgment  count 
in  the  pointer  file  incremented. 

Acknowledgments  for  flits  of  type  token  are  handled  slightly  differently, 
as  the  state  of  the  virtual  channel  is  initialized  as  soon  as  the  flit  made  it 
to  the  output  controller.  The  fact  that  it  was  of  type  token  is  also  placed  in 
the  3  deep  pipeline.  At  the  end  of  the  pipeline,  an  acknowledgment  is  always 
sent  when  the  type  token  indicator  appears. 
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6.4.9  Token  Generation  -  ict.token 


Module  I/O 


Name 

ihsi 

Description 

I 

The  flit  data  in. 

makejreplica 

I 

If  the  flit  is  of  type  tokenUnique,  coerce  it  to  be  a 
replica. 

gen_replica 

I 

Create  a  new  flit  of  type  tokenReplica. 

I 

The  virtual  channel  number  to  use  when  generating 
a  token. 

c^niH 

0 

The  flit  data  out. 

Description 

This  module  handles  the  coercion  and  the  generation  of  flits  to  type  token- 
RepUca. 
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6.5  Crossbar 

Module  I/O 


Name 

I/O 

Description 

d[4..0][7..0] 

I 

The  data  entering  the  crossbar 

0 

The  data  leaving  the  crossbar 

I 

The  bids  from  the  input  controllers 

0 

Indicates  to  the  output  controller  that  the 
data  appearing  is  V2did. 

bid^cepted[4..0] 

0 

Indicates  to  the  input  controller  that  a  cross- 
point  was  established. 

dst^cceptsJn[5..0] 

I 

The  signal  from  each  output  controller  which 
indicates  that  the  output  controller  accepted 
the  flit. 

dstjaccepts-out[4..0] 

0 

This  is  the  dst-acceptsJn  signal  brought  back 
over  the  crosspoint. 

EESSSSSEMam 

I 

The  destination’s  alarm. 

■■ 

0 

The  destination’s  alarm 

brought  back  through  the  crosspoint  to  the 
input  controller. 

Description 

This  module  is  composed  of  an  array  of  crosspoints  (xb_elt)  and  a  token  in¬ 
jector  (xbJnject).  Nearly  the  entire  crossbar  is  simply  the  two  dimensional 
array  of  crossbar  elements,  so  all  of  the  crosspoint  allocation  logic  is  dis¬ 
tributed  into  each  crosspoint.  This  makes  the  simulation  model  a  bit  hard 
to  understand,  but  allows  a  simpler  physical  realization. 

The  crossbar  is  organized  into  rows  and  columns.  Inputs  to  the  crossbar 
are  sent  along  the  rows,  outputs  are  taken  from  the  colunms. 
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6.5.1  Token  injector  -  xbjnject 

Module  I/O 


Name 

1/0 

Description 

latchjnew_state 

I 

A  once  per  flit  signal  used  to  change  the  state  of 
the  crossbar. 

0 

Only  one  of  these  will  be  asserted.  The  asserted 
token  is  used  to  determine  which  input  controller 
has  the  highest  priority  bid. 

Description 

In  order  to  ensure  fair  access  to  the  crossbar,  an  input  controller  is  selected 
at  random  to  have  the  highest  priority  bid. 
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6.5.2  Crossbar  crosspoints  -  xbarjelt 

Module  I/O 


Name 

I/O 

Description 

d[7..0] 

I 

The  row  data  to  be  moved  across  the  crosspoint. 

■qlO] 

0 

The  data  driven  onto  the  colunm. 

bid  2..0] 

I 

The  bid  from  the  input  controller. 

inject 

I 

The  initial  token  from  the  injector,  giving  this  row 
highest  priority. 

tokenjn 

I 

Priority  daisy>chain  in. 

tokenjout 

0 

Priority  daisy-chain  out. 

column  javailable 

I 

Column  was  not  allocated  by  the  router.  Not 
used. 

latch-new  jstate 

I 

Places  the  crosspoint  in  high  impedance  and  stores 
the  new  crosspoint  setting. 

column-valid 

0 

A  crosspoint  will  provide  valid  data  to  the  output 
controller  or  router  attached  to  this  column. 

bid-accepted 

0 

A  crosspoint  in  this  row  accepted  the  bid. 

dst-acceptsJn 

I 

The  flit  was  accepted  by  the  output  controller,  in. 

dst -accepts  jout 

0 

The  flit  was  accepted  by  the  output  controller, 
out. 

alarmJn 

I 

The  alarm  for  the  output  controller  attached  to 
this  row. 

alarm-out 

0 

The  alarm  sent  to  the  input  controller  from  an 
established  crosspoint. 

Description 

An  input  controller  broadcasts  a  bid  along  a  row  of  crosspoints.  Each  cross- 
I>oint  knows  its  column  number.  The  crosspoint  compares  the  bid  against 
its  column  number.  If  there  is  a  match,  that  crosspoint  is  responsible  for 
sending  status  information  back  to  the  input  controller. 

An  allocation  token  is  injected  along  one  row  of  the  crossbar  each  flit 
time.  When  a  crosspoint  has  the  allocation  token,  if  it  matched  the  input 
controller’s  bid,  it  establishes  the  crosspoint  and  asserts  column.valid.  If  the 
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crosspoint  did  not  match,  it  forwards  the  allocation  token  in  its  column. 

If  the  allocation  token  should  return  to  the  same  row  as  it  was  injected 
into,  that  crosspoint  deasserts  column_valid. 


6.6  Output  controller  -  oct.oct 


6.6.1  Module  I/O 


Name 

I/O 

Description 

I 

Data  from  the  crossbar. 

column.valid 

I 

An  indication  from  the  crossbar  that  crosspoint  to 
an  input  controller  has  been  established. 

send_alarm 

I 

The  alarm  indicator  from  the  paired  input 
controller. 

ack.to-send[2.  .0] 

I 

From  the  paired  input  controller,  this  indicates 
which  input  virtual  channel  has  successfully  repli¬ 
cated  a  flit  further  into  the  network  and  should 
now  be  acknowledged. 

_ 

I 

From  the  pdred  input  controller,  this  signal  indi¬ 
cates  the  availability  of  one  more  flit’s  worth  of 
storage  in  the  queue  across  the  physical  channel. 

pad 

0 

The  data  to  be  sent  to  the  output  pads. 

flit-valid 

0 

The  output  controller  asserts  this  signal  when  it 
accepts  the  flit. 

I 

From  the  router,  this  reserves  a  virtual  channel  at 
the  given  DR  level. 

I 

The  DR  number  of  the  packet  currently  being 
routed. 

free-vc 

0 

Asserted  when  the  output  controller  has  free  vir¬ 
tual  channeb. 

okayjto-wait 

0 

Asserted  when  the  the  DR  number  of  the  packet 
being  routed  is  less  than  or  equal  to  the  DR  num¬ 
ber  of  all  the  packets  assigned  to  virtual  channels. 

6.6.2  Description 


The  output  controller  is  made  of  two  submodules:  the  output  filter  and  the 
output  packagor. 

6.6.3  Output  filter  -  oct-filter 

Module  I/O 


Name 

I/O 

Description 

I 

Data  from  the  crossbar. 

colunm.valid 

I 

An  indication  from  the  crossbar  that  crosspoint  to 
an  input  controller  has  been  established. 

send-alarm 

I 

The  alarm  indicator  from  the  paired  input  controller. 

ack[2..0] 

I 

From  the  paired  input  controller,  this  signal  indi¬ 
cates  the  availability  of  one  more  flit’s  worth  of  stor¬ 
age  in  the  queue  across  the  physical  channel.  Cor¬ 
responds  the  ack-recvd  signal  in  the  surrounding 
module. 

q7..0] 

0 

Data  to  the  output  packager. 

flit-v^d 

0 

The  output  controller  asserts  this  signal  when  it  ac¬ 
cepts  the  flit. 

reserve_vc 

I 

From  the  router,  this  reserves  a  virtual  channel  at 
the  given  DR  level. 

I 

The  DR  number  of  the  packet  currently  being 
routed. 

free_vc 

0 

Asserted  when  the  output  controller  has  free  virtual 
channels. 

okay_to_wait 

0 

Asserted  when  the  the  DR  number  of  the  packet  be¬ 
ing  routed  is  less  than  or  equal  to  the  DR  number  of 
all  the  packets  assigned  to  virtual  channels. 
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Description 

The  output  filter  performs  three  functions.  It  implements  flow  control.  It 
maps  <input  port,  input  vc>  tuples  to  output  vc’s.  Finally,  it  provides 
output  virtual  channel  information  to  the  router. 

Flow  control  is  implemented  by  keeping  a  count  of  the  number  of  flits 
which  have  been  sent  on  an  output  virtual  channel  which  have  not  yet  be^ 
acknowledged.  When  the  count  exceeds  some  threshold,  the  output  filter  will 
stop  accepting  flits  from  the  crossbar  for  that  channel.  The  ack  input  will 
cause  the  count  for  that  virtual  channel  to  decrement. 

The  mapping  function  is  straightforward.  For  each  output  virtual  chan¬ 
nel,  its  associated  input  port  and  input  virtual  channel  are  remembered  in  a 
table.  When  a  flit  arrives,  the  table  is  searched  for  a  match  and  the  vc  field 
in  the  flit  updated. 

The  output  filter  remembers  the  DR  number  of  each  packet  associated 
with  an  active  output  virtual  channel.  When  the  router  is  determining  where 
to  send  a  packet,  it  first  broadcasts  the  packet’s  DR  number  to  all  the  output 
filters.  Each  filter  responds  to  the  router  with  an  indication  of  free  virtual 
channels.  If  there  are  none,  the  output  filter  must  tell  the  router  if  the  router 
is  able  to  wait  for  one  based  on  the  DR  number  presented  and  the  DR 
numbers  stored. 

When  the  router  determines  that  it  wants  a  packet  to  go  out  a  given 
output  port,  it  reserves  a  virtual  channel  in  the  output  filter.  This  is  to  allow 
the  output  filter  to  remember  the  DR  number  for  the  channel  in  time  for  the 
next  routing  decision.  If  the  output  filter  wuted  for  the  head  flit  to  arrive 
before  establishing  the  channel,  the  state  could  wind  up  being  inconsistent. 

When  the  head  flit  of  a  packet  arrives,  the  output  filter  looks  for  a  reserved 
virtual  channel  with  a  matching  DR  number  and  completes  the  map. 

The  output  filter  will  not  accept  a  flit  of  type  token  if  the  number  of 
unacknowledged  flits  is  non-zero.  Once  the  flit  is  accepted,  the  output  virtual 
channel  is  marked  as  reclaimNextAck.  The  next  acknowledgment  for  that 
virtual  channel  will  cause  the  channel  to  be  freed. 
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6.6.4  Output  packager  -  oct_package 

Module  I/O 


Name 

I/O 

Description 

d  7..0] 

I 

Data  from  the  output  filter. 

send_alarm 

I 

The  alarm  indicator  from  the  paired  input  controller. 

ack[2..0] 

I 

From  the  paired  input  controller,  this  indicates  which 
input  virtual  channel  has  successfully  replicated  a  flit 
further  into  the  network  and  should  now  be  acknowl¬ 
edged.  This  corresponds  to  ack_to^end  in  the  sur¬ 
rounding  module. 

0 

The  data  to  be  sent  to  the  output  pads. 

Description 

The  output  packager  splices  the  acknowledgment  and  alarm  fields  into  the 
outgoing  flit.  It  then  recomputes  the  CRC  field  for  the  flit.  Finally,  it 
computes  parity  for  the  first  and  last  phits  of  the  flit. 
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6.7  Router 


6.7.1  Module  I/O 


Name 

I/O 

Description 

I 

The  X  coordinate  of  the  switching  element. 

our.y  5..0 

I 

The  y  coordinate  of  the  switching  element. 

ourjz  5..0 

I 

The  z  coordinate  of  the  switching  element. 

0 

The  DR  number  of  the  packet  currently  be¬ 
ing  routed. 

okay_to_wait[4..0] 

I 

From  each  output  controller,  these  are  as¬ 
serted  when  the  the  DR  number  of  the 
packet  being  routed  is  less  than  or  equal  to 
the  DR  number  of  all  the  packets  assigned 
to  virtual  channels. 

free_vcsJFromjoct[4..0] 

I 

From  each  output  controller,  these  are  as¬ 
serted  when  the  output  controller  has  free 
virtual  channels. 

0 

This  reserves  a  virtual  channel  in  a  partic¬ 
ular  output  controller. 

ql7..0] 

0 

Data  input  to  the  crossbar.  Currently  not 
used. 

bid[2..0] 

0 

Output  port  select  to  the  crossbar.  Cur¬ 
rently  not  used. 

bid^cepted 

I 

Not  used. 

dstjaccepts 

I 

Not  used. 

d(7..0] 

I 

The  data  lines  from  the  crossbar  carrying 
the  head  flit  of  the  packet  to  be  routed. 

column.valid 

I 

An  indication  from  the  crossbar  that  a 
crosspoint  from  an  input  controller  has  been 
established. 

dst^cceptsjout 

0 

Currently  always  asserted,  it  indicates  that 
router  will  attempt  to  route  this  packet. 

dst^larm 

I 

Always  deasserted  and  currently  ignored. 
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Name 

I/O 

Description 

alarm^tub 

0 

Always  deasserted,  as  the  router  cannot 
have  an  alarm. 

0 

The  exit  port  fro  the  packet  being  routed. 

vc_being-routed  [2.  .0] 

0 

The  input  virtual  channel  of  the  packet  be¬ 
ing  routed. 

set_route[4..0] 

0 

Causes  the  input  controller  to  record  the 
routing  information  and  begin  forwarding 
the  packet. 

retry  joute[4..0] 

0 

Causes  the  input  controller  to  retry  the 
routing  the  packet  later. 

2Llarm  4..0] 

I 

The  alarm  conditions  of  all  the  ports. 

6.7.2  Description 

WheD  an  input  controller  decides  that  a  packet  needs  to  be  routed,  it  send 
the  head  flit  to  the  router.  The  router  parses  the  flit  into  destination  X,Y,Z 
coordinates  and  DR  number.  It  extracts  the  input  port  number  and  input 
virtual  channel  number.  It  then  attempts  to  And  an  output  port. 

To  And  the  output  port,  it  first  computes  the  desired  exit  port,  using 
dimension  ordered  routing.  If  the  desired  port  is  either  non-functional  or  if 
exiting  the  desired  port  would  result  in  a  backtrack,  a  new  port  is  chosen 
at  random.  If  the  exit  port  has  no  free  virtual  channels  and  the  packet  is 
unable  to  wait  owing  to  dimensional  reversal  constraints,  a  new  port  is  again 
chosen. 

Finally,  if  the  desired  exit  port  has  a  free  virtual  channel,  the  router 
reserves  the  channel  and  communicates  the  exit  port  to  the  input  controller. 
If  the  exit  port  had  no  free  channels,  the  router  tells  the  input  controller  to 
retry  the  route  request. 
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6.8  Summary 


In  this  chapter,  we  stated  the  goals  of  the  architecture,  namely  to  show  that 
the  unique  token  protocol  was  realizable  using  standard  cell  technology.  De¬ 
sign  guidelines  in  support  of  the  goals  were  developed,  describing  clocking, 
intermodule  communication,  and  circuit-design  style.  The  microarchitecture 
was  then  presented  at  the  top  level  and  successively  refined.  This  microar¬ 
chitecture  has  been  implemented  as  a  register-transfer  model  using  the  M 
modeling  language.  The  verification  of  this  model  will  be  discussed  in  the 
next  chapter. 
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Chapter  7 

Design  Verification 


We  have  developed  a  new  protocol,  the  unique  token  protocol,  for  reliably 
moving  data  through  a  network.  To  show  that  the  protocol  is  realizable, 
a  switching-element  microarchitecture  was  designed  and  presented  in  the 
previous  chapter.  This  microarchitecture  has  been  implemented  as  a  register- 
transfer  level  (RTL)  model  using  the  M  modeling  language^.  In  this  chapter, 
we  describe  the  validation  of  this  model  and  the  subsequent  observations. 

The  goal  of  the  verification  is  to  ensure  that  the  model  functions  as  a 
router  for  a  two  dimensional  mesh  network  and  that  the  implementation 
delivers  packets  according  to  the  unique  token  protocol,  under  both  fault- 
free  and  faulted  conditions. 

7.1  Verification  Procedure 

The  verification  process  was  done  in  four  stages  and,  as  bugs  were  found  and 
corrected,  repeated  several  times  until  the  model  was  verified.  The  first  stage 
is  a  single  chip  test.  The  original  purpose  of  this  test  was  to  aid  in  bringing 
up  a  single  device,  along  with  the  various  traffic  sources  and  sinks.  Once 
the  design  stabilized,  this  test  evolved  into  a  confidence  test  of  the  switching 
element  by  injecting  random  traffic  for  an  extended  period  of  time.  This 
stage  of  testing  does  not  not  involve  any  fault  recovery. 

^Mentor  Graphics,  Silicon  Design  Division 
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The  second  stage  of  testing  connects  two  devices  together.  The  purpose 
is  to  expose  the  discrepancies  between  the  artificial  traffic  generated  by  the 
test  fixture  and  the  actual  traffic  between  two  switching  elements. 

In  the  third  stage  of  testing,  a  2  x  2  mesh  is  constructed  and  traffic  gen¬ 
erators  attached  to  all  spare  ports.  This  provided  a  slightly  longer  path  for 
messages  to  travel  and  served  as  a  base  case  for  the  fourth  stage. 

The  final  stage  of  testing  is  agdn  a  2  x  2  mesh.  Traffic  is  sourced  only 
from  a  single  processor  port  and  is  sent  to  the  processor  port  on  the  opposite 
comer  of  the  mesh.  Since  the  router  first  attempts  dimension  ordered  routing, 
this  traffic  would  travel  along  a  preferred  route.  The  first  link  along  this  route 
is  sabotaged  and  the  protocol  allowed  to  adapt  to  the  fault  and  recover.  The 
timing  of  the  fault  is  changed  from  run  to  run. 


7.2  Test  Jigs  and  Scaffolding 

Support  code  falls  into  three  classes:  port  stubs,  fixtures,  and  post-mortem 
sanity  checks. 

A  port  stub  provides  a  source  of  traffic  to  send  into  a  port  on  a  chip.  It  is 
capable  of  injecting  a  variable  number  of  packets  into  all  the  virtual  channels 
within  a  port.  The  length  of  the  packet  is  chosen  at  random  between  some 
easily  changed  lower  and  upper  bounds.  The  packet  consists  of  a  head  flit,  a 
data  flit  containing  a  unique  identifier,  some  number  of  data  flits  containing 
random  data,  and  a  token  flit. 

The  destination  address  contzuned  in  each  packet  may  be  bounded  by 
min/max  X,Y,Z  pairs  independently  in  each  port  stub.  The  port  stub  will  not 
generate  traffic  to  its  local  processor  port.  It  also  receives  and  acknowledges 
flits  to  allow  the  flow  control  algorithm  to  work.  Without  such  support,  the 
chip  model  hangs  after  it  has  sent  a  few  flits. 

All  flits  sent  and  received  are  written  in  human-readable  form  into  a  log 
file,  to  make  debugging  easier.  The  format  is  also  readily  parsed  by  computer 
programs  for  more  automated  verification. 

Three  major  and  several  minor  variants  of  the  test  fixture  have  been 
written.  The  first  fixture  is  a  single  chip  with  port  stubs  attached  at  all  five 
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locations.  This  fixture  was  heavily  used  throughout  the  course  of  writing  and 
debugging  the  model.  The  second  fixture  has  two  chip  models  interconnected 
with  port  stubs  at  all  other  open  locations. 

The  third  fixture  implements  a  2  x  2  mesh.  This  fixture  was  used  mainly 
to  verify  the  reliability  protocol.  The  stubs  are  attached  to  the  processor 
ports  of  the  chips.  Two  variants  have  been  constructed,  one  with  stubs  at 
the  edges  of  the  mesh,  the  other  with  the  edge  ports  grounded.  This  fixture 
is  capable  of  injecting  a  fault  along  one  the  links.  The  fault  time  may  be 
varied  at  run  time  via  a  parameter  on  the  simulator  command  line.  This 
makes  it  possible  to  write  a  shell  script  to  make  repeated  simulator  runs, 
stepping  through  fault  times. 

A  program,  sane,  has  been  written  to  verify  the  correct  operation  of  the 
switching  element.  It  first  compiles  a  list  of  packets  which  have  been  injected, 
then  compiles  the  list  of  packets  extracted.  It  discards  runt  packets,  those 
that  consist  only  of  a  head  flit  and  a  token.  It  splices  together  packets  from 
the  packet  fragments  arriving  tagged  as  type  replica.  Finally,  it  compares 
the  packets  sent  against  the  packets  received  for  length  and  contents.  If  all 
packets  sent  have  a  corresponding  packet  received  and  vice-versa,  then  the 
test  has  been  passed. 


7.3  Testing  Results 

Three  different  configurations  were  tested  under  normal  operation.  A  single 
chip  was  injected  with  720  packets,  two  chips  with  9  packets,  and  the  2x2 
mesh  with  640  packets.  All  configurations  passed. 

A  2  X  2  mesh  was  used  as  the  basis  for  the  fourth  test.  The  source 
of  traffic  was  the  processor  port  on  the  lower  left.  It  sent  six  messages  of 
random  lengths  to  the  processor  at  the  upper  right.  Since  the  router  is 
primarily  dimension  ordered,  the  usual  path  includes  the  switching  element 
on  the  lower  right.  At  some  programmable  time,  the  wires  leading  from  the 
node  on  the  lower  left  to  the  node  on  the  lower  right  were  forced  to  all  zeros. 

This  basic  test  was  run  from  within  a  shell  script  which  stepped  the  failure 
time  from  time  zero  until  a  time  after  all  six  packets  had  been  received.  After 
each  pass,  the  sanity  checker  was  run  and  the  results  written  into  a  file.  All 
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runs  passed  the  sanity  checker. 


7.4  Summary 

The  register  transfer  model  of  the  reliable  switching  element  has  been  simu¬ 
lated  in  a  variety  of  configurations  and  stresses.  It  has  shown  correct  func¬ 
tionality  under  these  conditions.  From  this,  we  conclude  that  the  model 
constructed  is  a  faithful  implementation  of  the  unique  token  protocol. 
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Chapter  8 
Conclusions 


This  thesis  presented  a  new  protocc'  the  unique  token  protocol,  for  reliable 
conununication  using  replication  in  a  network.  It  contrasted  the  scalability 
of  this  protocol  with  existing  end-to-end  protocols.  It  described  the  organiza¬ 
tion  and  microarchitecture  of  a  switching  element  implementing  the  protocol. 
Finally,  it  presented  the  results  of  the  design  verification  simulations.  This 
work  has  led  to  several  conclusions  concerning  the  protocol  and  possible  im¬ 
plementations. 

The  unique  token  protocol  can  be  used  in  a  wide  variety  of  network  topolo¬ 
gies.  It  consumes  less  network  bandwidth  than  end-to-end  protocols  when 
acknowledgments  are  taken  into  account.  It  offers  excellent  scalability,  with 
storage  requirements  dominated  by  0{n)  terms,  where  n  is  the  number  of 
processors. 

What  the  protocol  does  under  some  boundary  conditions  is  not  well  de¬ 
fined  in  this  thesis.  For  example,  the  recovery  procedure  to  use  when  the 
destination  fails  and  the  message  is  undeliverable  is  not  given,  as  it  has 
strong  interactions  with  the  method  of  adaptive  routing  chosen.  One  has  to 
decide  that  the  destination  is  truly  unreachable  and  not  that  the  bulk  of  the 
physical  channels  leading  into  the  final  switching  element  are  down.  This  is 
currently  handled  by  delivering  the  packet  to  the  nearest  processor  when  the 
packet’s  dimension  reversal  number  overflows.  Software  will  then  be  used 
to  probe  for  alternate  routes  or  return  the  packet  if  none  exist.  A  future 
research  topic  is  to  provide  semantics  for  undeliverable  packets. 
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The  implementation  provided  sevaal  insights.  The  key  one  is  that  the 
protocol  is  indeed  implementable  without  a  dramatic  increase  in  logic  com¬ 
plexity  over  an  adaptive,  deadlock-free  switching  element.  Indeed,  the  changes 
are  confined  mainly  to  the  input  controller. 

The  second  insight  is  that  containment  of  network  errors  can  lead  to 
increased  latency.  The  switching  element  has  state  which  is  altered  by  in¬ 
coming  packets.  The  model,  as  constructed,  did  not  allow  any  critical  state 
to  be  altered  until  the  incoming  data  could  be  certified  error-free.  Thus,  the 
progress  of  the  data  is  slowed  by  the  error  checking  process.  This  increased 
latency  may  be  dealt  with  by  providing  wider  data  paths,  which  collapses 
the  time  from  start  of  flit  until  verification.  Secondly,  one  could  checkpoint 
the  critical  state  and  backtrack  upon  error  detection.  The  amount  of  criti¬ 
cal  state  is  heavily  determined  by  the  network  topology  and  the  complexity 
of  mechanisms  required  for  adaptive  routing  and  deadlock  avoidance.  Some 
topologies,  such  as  indirect  connection  networks,  are  intrinsically  deadlock 
free  and  may  have  less  complex  checkpointing  circuits. 

A  second  cause  of  latency  in  the  model  is  the  use  of  a  centralized  router. 
By  centralizing  the  router,  several  input  controllers  contend  for  its  use.  If  the 
typical  packet  size  is  large,  this  contention  will  be  minimized.  The  central¬ 
ization  was  done  for  practical  reasons  in  order  to  avoid  the  construction  of 
some  form  of  distributed  router.  The  problem  is  that  the  router  not  only  de¬ 
cides  where  to  send  the  packet,  it  must  also  allocate  resources  in  the  form  of 
channel  bandwidth.  Distributing  the  resource  allocation  decision  is  possible 
through  the  introduction  of  a  second  crossbar  array.  This  second  crossbar 
would  allow  an  input  controller  to  lock  the  output  controller  and  allocate  a 
virtual  channel  without  interference  from  other  input  controllers. 
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Appendix  A 

Buffer  Storage  Requirements 
of  End-to-End  Protocols 


End'to-end  protocols  require  storage  in  the  source.  We  need  to  understand 
how  that  storage  requirement  changes  with  machine  size.  Consider  a  node 
in  the  middle  of  a  very  long  chain  of  n  +  1  nodes  as  shown  in  figure  A.l. 
Each  physical  channel  has  bandwidth  u;,  where  u  has  units  of  msgs/sec.  The 
system  has  an  aggregate  bandwidth  of  no;,  measured  in  channehmsgs/sec. 
When  this  aggregate  bandwidth  is  evenly  distributed  over  the  nodes,  each 
node  may  use  u  channel-msgs/sec. 

Let  A  be  a  probability  distribution  function  (PDF)  describing  the  likeli¬ 
hood  that  a  node  wishes  to  send  a  message  to  another  node,  passing  through 
d  channels,  (d  =  0  implies  no  message  sent.) 

Vd,d>0,  A(d)  =  arf  (A.l) 

Since  it  is  a  PDF,  we  know  that: 


=  *  (A-2) 

dsO 

Each  node  would  be  using  system  bandwidth  equal  to: 

OO 

udad  (A.3) 

d=0 
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Figure  A.l:  Simple  Traffic  Model 

which  must  be  equal  to  w  under  uniform  maximum  loads.  We  can  then  write: 

53  dflrf  =  1  (A.4) 

dsO 


Let  the  intemode  delay  on  a  channel  be  e.  Let  7  be  the  time  it  takes  a 
node  to  receive  a  message  and  generate  an  acknowledgment.  The  total  round 
trip  for  a  message  and  its  acknowledgment  to  a  node  d  channels  away  is: 

2d€  +  7  (A.5) 

The  expected  amount  of  storage  required  to  hold  copies  of  messages  sent 
to  a  node  d  channels  away  is  simply  the  probability  of  sending  a  message  d 
channels  away  x  channel  bandwidth  x  duration: 

adu;{2de  +  7)  (A.6) 

The  total  expected  storage  is  then: 

00  00  00 

53  adw(2de  +  7)  =  2u;€  53  +  ‘*;7  53  (A-7) 

dsO  dsO  dsO 

Using  equations  A.2  and  A.4,  this  reduces  to: 

u;(2c  +  7)  (A.8) 
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The  expected  amount  of  storage  used  to  buffer  messages  is  constant, 
irrespective  of  the  actual  PDFs  used  to  model  the  traffic  sources.  Note  that 
we  have  only  calculated  the  expected  amount  of  storage  used  under  uniform 
conditions.  When  a  node  can  consume  a  disproportionately  large  share  of 
the  system  bandwidth,  its  expected  storage  requirements  increase.  This  can 
be  seen  mathematically  by  introducing  a  load  factor  into  equation  A.4: 

f^dai  =  <t,  (A.9) 

4=0 

which  results  in  expected  storage  of: 

ij{2e<f>  +  -f)  (A.IO) 

Intuitively,  we  can  bound  ^  by  considering  a  network  where  only  one  node 
is  sending  traffic,  and  that  is  to  a  destination  node  over  the  longest  possible 
path.  For  2D  mesh  networks,  these  nodes  are  on  opposite  comers  of  the 
machine.  <f>  is  then  simply  the  number  of  hops  needed  to  get  from  source  to 
destination.  For  such  a  mesh  network,  we  have; 

^  =  2(v^-l)  (A.ll) 

Substituting  A.ll  into  A.IO  yields  an  expected  storage  requirement  of: 

u;(4e(>/n  -  1) +  7)  (A.12) 

We  see  that  the  expected  storage  requirements  in  2D  meshes  for  end-to- 
end  protocols  grow  as 
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